CS6140 Assignment 3: Regression Models Part 1¶
- Student: Sukhrobbek Ilyosbekov
Exploratory Data Analysis (EDA) on EPA Vehicle Emissions Data¶
In this notebook, exploratory data analysis was performed on the EPA vehicle emissions data from the years 2015 to 2024. The goal is to understand the data, identify patterns, and prepare it for regression analysis.
Table of Contents
- Initial Exploration and Summary Statistics
- Data Structure and Missing Values
- Comparing Column Names Across Years
- Distribution of Key Variables
- Correlation Analysis
- Combining Datasets
- Standardizing Continuous Variables
- Temporal Analysis
- Vehicle Manufacturer Analysis
- Creating a Clean Dataset for Regression
Setup¶
In order to run this notebook, we need to have the following libraries installed:
- pandas
- numpy
- matplotlib
- seaborn
- openpyxl
we can install these libraries using the following command:
pip install pandas numpy matplotlib seaborn openpyxl
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def get_data_path(relative_path: str) -> str:
"""
Get the absolute path to a file in the `dataset` directory.
Args:
relative_path: The path to the file relative to the `dataset` directory including the file name.
Returns:
The full path to the file in the `dataset` directory.
Examples:
>>> get_data_path("assignment1/boston_listings.csv")
'C:/Users/username/assignments/dataset/assignment1/boston_listings.csv'
"""
return os.path.abspath(os.path.join("../../dataset", relative_path))
1. Initial Exploration and Summary Statistics ¶
Load the datasets for all the years and compare the basic summary statistics.
# Define the file paths
csv_files = ["15-testcar.csv", "16-testcar.csv"]
xlsx_files = ["17-testcar.xlsx", "18-testcar.xlsx", "19-testcar.xlsx",
"20-testcar.xlsx", "21-testcar.xlsx", "22-testcar.xlsx",
"23-testcar.xlsx", "24-testcar.xlsx"]
# Load CSV files
dfs_csv: list[pd.DataFrame] = []
for file in csv_files:
file_path = get_data_path(f"assignment3/{file}")
df = pd.read_csv(file_path)
df["Year"] = "20" + file[:2] # Extract year from filename
dfs_csv.append(df)
# Load Excel files
dfs_excel: list[pd.DataFrame] = []
for file in xlsx_files:
file_path = get_data_path(f"assignment3/{file}")
df = pd.read_excel(file_path)
df["Year"] = "20" + file[:2] # Extract year from filename
dfs_excel.append(df)
# Add all dataframes to a single list
dfs = dfs_csv + dfs_excel
print("Number of datasets loaded:", len(dfs))
print("The shape of data for each dataset:")
for i, df in enumerate(dfs):
print(f"Dataset {df["Year"].iloc[0]}: {df.shape}")
print("Few rows of the first dataset:")
display(dfs[0].head())
print("Columns in the dataset:")
print(dfs[0].columns)
Number of datasets loaded: 10 The shape of data for each dataset: Dataset 2015: (4832, 68) Dataset 2016: (4650, 68) Dataset 2017: (4685, 68) Dataset 2018: (4727, 68) Dataset 2019: (4719, 68) Dataset 2020: (4450, 68) Dataset 2021: (4265, 68) Dataset 2022: (4493, 68) Dataset 2023: (4521, 68) Dataset 2024: (4137, 68) Few rows of the first dataset:
| Model Year | Vehicle Manufacturer Name | Veh Mfr Code | Represented Test Veh Make | Represented Test Veh Model | Test Vehicle ID | Test Veh Configuration # | Test Veh Displacement (L) | Actual Tested Testgroup | Vehicle Type | ... | Set Coef B (lbf/mph) | Set Coef C (lbf/mph**2) | Aftertreatment Device Cd | Aftertreatment Device Desc | Police - Emergency Vehicle? | Averaging Group ID | Averaging Weighting Factor | Averaging Method Cd | Averging Method Desc | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015 | aston martin | ASX | Aston Martin | DB9 | 143TT1042 | 0 | 5.9 | DASXV05.9VH1 | Car | ... | 0.2990 | 0.0192 | TWC | Three-way catalyst | N | NaN | NaN | N | No averaging | 2015 |
| 1 | 2015 | aston martin | ASX | Aston Martin | DB9 | 143TT1042 | 0 | 5.9 | DASXV05.9VH1 | Car | ... | 0.2990 | 0.0192 | TWC | Three-way catalyst | N | NaN | NaN | N | No averaging | 2015 |
| 2 | 2015 | aston martin | ASX | Aston Martin | Rapide S | 410TT4202 | 0 | 6.0 | FASXV06.0VHB | Car | ... | 2.1980 | 0.0280 | TWC | Three-way catalyst | N | NaN | NaN | N | No averaging | 2015 |
| 3 | 2015 | aston martin | ASX | Aston Martin | Rapide S | 410TT4202 | 0 | 6.0 | FASXV06.0VHB | Car | ... | 2.1980 | 0.0280 | TWC | Three-way catalyst | N | NaN | NaN | N | No averaging | 2015 |
| 4 | 2015 | aston martin | ASX | Aston Martin | V12 Vantage S | 288TT2087 | 0 | 6.0 | FASXV06.0VHB | Car | ... | 0.4063 | 0.0186 | TWC | Three-way catalyst | N | NaN | NaN | N | No averaging | 2015 |
5 rows × 68 columns
Columns in the dataset:
Index(['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code',
'Represented Test Veh Make', 'Represented Test Veh Model',
'Test Vehicle ID', 'Test Veh Configuration #',
'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type',
'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code',
'Tested Transmission Type Code', 'Tested Transmission Type',
'# of Gears', 'Transmission Lockup?', 'Drive System Code',
'Drive System Description', 'Transmission Overdrive Code',
'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)',
'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd',
'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator',
'Analytically Derived FE?', 'ADFE Test Number',
'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)',
'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description',
'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category',
'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)',
'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1',
'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating',
'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating',
'Target Coef A (lbf)', 'Target Coef B (lbf/mph)',
'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)',
'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)',
'Aftertreatment Device Cd', 'Aftertreatment Device Desc',
'Police - Emergency Vehicle?', 'Averaging Group ID',
'Averaging Weighting Factor', 'Averaging Method Cd',
'Averging Method Desc', 'Year'],
dtype='object')
There are total 10 datasets for the years 2015 to 2024. Loaded all datasets and stored them in the list dfs.
All datasets have the same 68 columns, but the number of rows varies across years and it is in range from 4000 to 5000.
# Calculate summary statistics for "Test Veh Displacement (L)" and "Set Coef A (lbf)" in each dataset
for df in dfs:
year = df["Year"].iloc[0]
print(f"Summary Statistics for Year {year}:")
print(df[["Test Veh Displacement (L)", "Set Coef A (lbf)"]].describe())
print("\n")
Summary Statistics for Year 2015:
Test Veh Displacement (L) Set Coef A (lbf)
count 4832.000000 4832.000000
mean 3.334225 12.015613
std 4.992816 10.256352
min 0.001000 -28.100000
25% 2.000000 6.119000
50% 3.000000 12.117000
75% 3.700000 19.010250
max 99.999000 150.200000
Summary Statistics for Year 2016:
Test Veh Displacement (L) Set Coef A (lbf)
count 4650.000000 4650.000000
mean 3.234766 12.148468
std 5.084566 9.498626
min 0.001000 -30.012000
25% 2.000000 6.342500
50% 2.967000 12.364000
75% 3.600000 18.840000
max 99.999000 60.600000
Summary Statistics for Year 2017:
Test Veh Displacement (L) Set Coef A (lbf)
count 4685.000000 4685.000000
mean 3.148157 12.998771
std 4.679822 10.118586
min 0.001000 -27.337000
25% 2.000000 6.500000
50% 2.500000 13.050000
75% 3.596000 19.230000
max 99.999000 60.600000
Summary Statistics for Year 2018:
Test Veh Displacement (L) Set Coef A (lbf)
count 4727.000000 4727.000000
mean 3.250181 12.971918
std 6.119035 10.339964
min 0.001000 -25.362000
25% 2.000000 6.210000
50% 2.500000 12.700000
75% 3.500000 19.110000
max 99.999000 60.600000
Summary Statistics for Year 2019:
Test Veh Displacement (L) Set Coef A (lbf)
count 4719.000000 4719.000000
mean 3.294054 12.706450
std 6.744411 10.764788
min 0.001000 -25.362000
25% 2.000000 6.330000
50% 2.500000 12.000000
75% 3.500000 18.246500
max 99.999000 64.520000
Summary Statistics for Year 2020:
Test Veh Displacement (L) Set Coef A (lbf)
count 4450.000000 4450.000000
mean 3.258976 12.070814
std 6.307946 11.749594
min 0.001000 -99.900000
25% 2.000000 5.800000
50% 2.500000 11.590000
75% 3.500000 18.360000
max 99.999000 64.520000
Summary Statistics for Year 2021:
Test Veh Displacement (L) Set Coef A (lbf)
count 4265.000000 4265.000000
mean 3.214475 10.474279
std 6.099970 12.183433
min 0.001000 -52.800000
25% 2.000000 4.900000
50% 2.500000 11.130000
75% 3.500000 17.340000
max 99.999000 84.700000
Summary Statistics for Year 2022:
Test Veh Displacement (L) Set Coef A (lbf)
count 4493.000000 4493.000000
mean 3.325435 8.709525
std 6.628138 11.929443
min 0.001000 -52.800000
25% 2.000000 2.698000
50% 2.500000 10.080000
75% 3.500000 15.917000
max 99.999000 58.000000
Summary Statistics for Year 2023:
Test Veh Displacement (L) Set Coef A (lbf)
count 4521.000000 4521.000000
mean 3.918982 8.997686
std 10.664489 11.619730
min 0.001000 -52.800000
25% 2.000000 2.320000
50% 2.500000 10.360000
75% 3.500000 16.288000
max 99.999000 52.160000
Summary Statistics for Year 2024:
Test Veh Displacement (L) Set Coef A (lbf)
count 4137.000000 4137.000000
mean 3.990515 7.130689
std 11.646181 12.743283
min 0.001000 -65.390000
25% 1.991000 -0.060000
50% 2.487000 8.824000
75% 3.456000 15.190000
max 99.999000 99.999000
The summary statistics, including the mean, median, standard deviation (std), min, max, and quartiles (25th percentile, 50th percentile, and 75th percentile) for each year, are important for understanding the spread and distribution of the data.
- Mean (Average): Indicates the central tendency of the data. A higher mean in "Test Veh Displacement (L)" suggests that, on average, vehicle engines had larger displacements in certain years.
- Standard Deviation (Std): Reflects the variability in the data. A higher standard deviation means the data points are more spread out from the mean.
- Min/Max: These values help detect potential outliers. For example, a "Test Veh Displacement" value of 99.999L is likely an outlier, as it"s an extreme value compared to typical vehicle engine sizes.
- Quartiles (25th, 50th, 75th Percentile): These provide insights into the spread of the data. The 25th percentile (Q1) and 75th percentile (Q3) give a range where the middle 50% of the data lie.
For instance, in 2020, the 25th percentile for "Test Veh Displacement (L)" is 2.000, while the 75th percentile is 3.500, showing that most vehicles in that year had a displacement between these values.
Detecting Outliers and Anomalies¶
The IQR and z-score methods were used to detect outliers in the continuous variables. The IQR method identifies outliers as values that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. The z-score method identifies outliers as values that fall below the mean - 3 * standard deviation or above the mean + 3 * standard deviation.
from matplotlib.axes import Axes
def detect_outliers_iqr(df: pd.DataFrame, column: str) -> pd.DataFrame:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] < lower_bound) | (df[column] > upper_bound)] # type: ignore
def detect_outliers_zscore(df: pd.DataFrame, column: str, threshold=3) -> pd.DataFrame:
mean = np.mean(df[column])
std = np.std(df[column])
z_scores = (df[column] - mean) / std # type: ignore
return df[np.abs(z_scores) > threshold]
# Loop through each dataset and apply outlier detection
for i, df in enumerate(dfs):
year = df["Year"].iloc[0]
print(f"Outliers for year {year}:")
# IQR method
displacement_outliers_iqr = detect_outliers_iqr(df, "Test Veh Displacement (L)")
coef_a_outliers_iqr = detect_outliers_iqr(df, "Set Coef A (lbf)")
print(f"Test Veh Displacement (IQR method): {len(displacement_outliers_iqr)} outliers")
print(f"Set Coef A (IQR method): {len(coef_a_outliers_iqr)} outliers")
# Z-score method
displacement_outliers_z = detect_outliers_zscore(df, "Test Veh Displacement (L)")
coef_a_outliers_z = detect_outliers_zscore(df, "Set Coef A (lbf)")
print(f"Test Veh Displacement (Z-score method): {len(displacement_outliers_z)} outliers")
print(f"Set Coef A (Z-score method): {len(coef_a_outliers_z)} outliers\n")
# Plot the boxplot
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax: tuple[Axes, Axes]
sns.boxplot(x=df["Test Veh Displacement (L)"], ax=ax[0])
ax[0].set_title(f"Test Veh Displacement (Year {year})")
sns.boxplot(x=df["Set Coef A (lbf)"], ax=ax[1])
ax[1].set_title(f"Set Coef A (Year {year})")
plt.show()
Outliers for year 2015: Test Veh Displacement (IQR method): 84 outliers Set Coef A (IQR method): 95 outliers Test Veh Displacement (Z-score method): 12 outliers Set Coef A (Z-score method): 51 outliers
Outliers for year 2016: Test Veh Displacement (IQR method): 135 outliers Set Coef A (IQR method): 60 outliers Test Veh Displacement (Z-score method): 12 outliers Set Coef A (Z-score method): 29 outliers
Outliers for year 2017: Test Veh Displacement (IQR method): 194 outliers Set Coef A (IQR method): 105 outliers Test Veh Displacement (Z-score method): 10 outliers Set Coef A (Z-score method): 33 outliers
Outliers for year 2018: Test Veh Displacement (IQR method): 208 outliers Set Coef A (IQR method): 119 outliers Test Veh Displacement (Z-score method): 18 outliers Set Coef A (Z-score method): 40 outliers
Outliers for year 2019: Test Veh Displacement (IQR method): 188 outliers Set Coef A (IQR method): 186 outliers Test Veh Displacement (Z-score method): 22 outliers Set Coef A (Z-score method): 57 outliers
Outliers for year 2020: Test Veh Displacement (IQR method): 196 outliers Set Coef A (IQR method): 170 outliers Test Veh Displacement (Z-score method): 18 outliers Set Coef A (Z-score method): 51 outliers
Outliers for year 2021: Test Veh Displacement (IQR method): 212 outliers Set Coef A (IQR method): 243 outliers Test Veh Displacement (Z-score method): 16 outliers Set Coef A (Z-score method): 51 outliers
Outliers for year 2022: Test Veh Displacement (IQR method): 228 outliers Set Coef A (IQR method): 121 outliers Test Veh Displacement (Z-score method): 20 outliers Set Coef A (Z-score method): 48 outliers
Outliers for year 2023: Test Veh Displacement (IQR method): 244 outliers Set Coef A (IQR method): 97 outliers Test Veh Displacement (Z-score method): 54 outliers Set Coef A (Z-score method): 43 outliers
Outliers for year 2024: Test Veh Displacement (IQR method): 223 outliers Set Coef A (IQR method): 112 outliers Test Veh Displacement (Z-score method): 59 outliers Set Coef A (Z-score method): 39 outliers
Detailed Interpretation of Outliers Detected in "Test Veh Displacement (L)" and "Set Coef A (lbf)"¶
The outlier detection was performed using two methods: the Interquartile Range (IQR) method and the Z-score method. Both methods are useful for identifying potential anomalies, but they differ in how they define an outlier. Below, we analyze the results from both methods for each variable across different years and interpret the key points.
Outliers Detected in "Test Veh Displacement (L)"¶
IQR Method Results:¶
- Year 2015: 84 outliers
- Year 2016: 135 outliers
- Year 2017: 194 outliers
- Year 2018: 208 outliers
- Year 2019: 188 outliers
- Year 2020: 196 outliers
- Year 2021: 212 outliers
- Year 2022: 228 outliers
- Year 2023: 244 outliers
- Year 2024: 223 outliers
Z-score Method Results:¶
- Year 2015: 12 outliers
- Year 2016: 12 outliers
- Year 2017: 10 outliers
- Year 2018: 18 outliers
- Year 2019: 22 outliers
- Year 2020: 18 outliers
- Year 2021: 16 outliers
- Year 2022: 20 outliers
- Year 2023: 54 outliers
- Year 2024: 59 outliers
Key Observations and Interpretation for "Test Veh Displacement (L)":¶
IQR Method consistently detects a higher number of outliers than the Z-score method. This indicates that the IQR method is more sensitive to detecting data points that fall outside the interquartile range, particularly those that are significantly larger or smaller than the typical values.
Increasing Number of Outliers Over Time (IQR Method):
- There is a clear upward trend in the number of outliers detected using the IQR method, from 84 outliers in 2015 to 244 outliers in 2023. This could indicate greater variability in vehicle engine sizes in more recent years. The increase in large-displacement vehicles, such as trucks and SUVs, may account for this pattern.
- The peak in outliers in 2023 suggests that this year had the highest number of vehicles with unusually large or small displacements, potentially due to changes in vehicle types, new models, or testing methods.
Z-score Method"s Conservative Detection:
- The Z-score method is more conservative, identifying far fewer outliers. This method is stricter, only flagging data points that are several standard deviations away from the mean.
- The 2015-2020 period shows relatively stable and low counts of outliers (between 10 and 22 outliers), but 2023 and 2024 show a sharp increase (with 54 and 59 outliers, respectively). This indicates an increase in vehicles with extremely large or small engine sizes in recent years.
Extreme Values (Max/Min): The presence of extreme values like 99.999 L as the maximum in several years suggests potential data entry errors or highly unusual vehicles (e.g., prototype or experimental models). These need to be verified or removed, as they could distort further analysis.
Outliers Detected in "Set Coef A (lbf)"¶
IQR Method Results:¶
- Year 2015: 95 outliers
- Year 2016: 60 outliers
- Year 2017: 105 outliers
- Year 2018: 119 outliers
- Year 2019: 186 outliers
- Year 2020: 170 outliers
- Year 2021: 243 outliers
- Year 2022: 121 outliers
- Year 2023: 97 outliers
- Year 2024: 112 outliers
Z-score Method Results:¶
- Year 2015: 51 outliers
- Year 2016: 29 outliers
- Year 2017: 33 outliers
- Year 2018: 40 outliers
- Year 2019: 57 outliers
- Year 2020: 51 outliers
- Year 2021: 51 outliers
- Year 2022: 48 outliers
- Year 2023: 43 outliers
- Year 2024: 39 outliers
Key Observations and Interpretation for "Set Coef A (lbf)":¶
Fluctuation in IQR Outliers:
- The IQR method detects a fluctuating number of outliers, with significant peaks in 2019 (186 outliers) and 2021 (243 outliers).
- These fluctuations suggest that certain years had more vehicles with unusual aerodynamic properties (i.e., extreme coefficients). This could be due to new vehicle designs or aerodynamic testing methods.
Consistent Detection by Z-score:
- The Z-score method consistently detects fewer outliers compared to the IQR method, identifying around 29 to 57 outliers per year. This suggests that while the data has variability, only a limited number of vehicles have extreme coefficient values far from the mean.
High Outliers in 2019 and 2021 (IQR Method):
- The peak in 2019 (186 outliers) and 2021 (243 outliers) using the IQR method could be indicative of certain vehicle models that deviated significantly from the norm in terms of aerodynamic properties (high drag coefficients).
- This may reflect changes in vehicle design or testing standards during these years, or the introduction of larger, less aerodynamic vehicles into the dataset.
Potential Data Entry Errors: The presence of negative values in "Set Coef A (lbf)" (e.g., -52.8 lbf in some years) might indicate incorrect data or measurement anomalies, as coefficient A represents a drag coefficient, which is typically a positive value. These negative values should be flagged and reviewed for correctness.
2. Data Structure and Missing Values ¶
# Check data structure for each dataset
for df in dfs:
year = df["Year"].iloc[0]
print(f"Data Structure for Year {year}:")
print(df.info())
print("\n")
# Check for missing values in each dataset
for df in dfs:
year = df["Year"].iloc[0]
print(f"Missing Values for Year {year}:")
missing = df.isnull().sum()
print(missing[missing > 0])
print("\n")
Data Structure for Year 2015: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4832 entries, 0 to 4831 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4832 non-null int64 1 Vehicle Manufacturer Name 4832 non-null object 2 Veh Mfr Code 4832 non-null object 3 Represented Test Veh Make 4832 non-null object 4 Represented Test Veh Model 4832 non-null object 5 Test Vehicle ID 4832 non-null object 6 Test Veh Configuration # 4832 non-null int64 7 Test Veh Displacement (L) 4832 non-null float64 8 Actual Tested Testgroup 4832 non-null object 9 Vehicle Type 4832 non-null object 10 Rated Horsepower 4832 non-null int64 11 # of Cylinders and Rotors 4666 non-null float64 12 Engine Code 4832 non-null object 13 Tested Transmission Type Code 4832 non-null object 14 Tested Transmission Type 4832 non-null object 15 # of Gears 4832 non-null int64 16 Transmission Lockup? 4832 non-null object 17 Drive System Code 4832 non-null object 18 Drive System Description 4832 non-null object 19 Transmission Overdrive Code 4832 non-null int64 20 Transmission Overdrive Desc 4832 non-null object 21 Equivalent Test Weight (lbs.) 4832 non-null int64 22 Axle Ratio 4832 non-null float64 23 N/V Ratio 4832 non-null float64 24 Shift Indicator Light Use Cd 4832 non-null int64 25 Shift Indicator Light Use Desc 4832 non-null object 26 Test Number 4832 non-null object 27 Test Originator 4832 non-null object 28 Analytically Derived FE? 4832 non-null object 29 ADFE Test Number 385 non-null object 30 ADFE Total Road Load HP 385 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 387 non-null float64 32 ADFE N/V Ratio 385 non-null float64 33 Test Procedure Cd 4832 non-null int64 34 Test Procedure Description 4832 non-null object 35 Test Fuel Type Cd 4832 non-null int64 36 Test Fuel Type Description 4832 non-null object 37 Test Category 4832 non-null object 38 THC (g/mi) 4519 non-null float64 39 CO (g/mi) 4530 non-null float64 40 CO2 (g/mi) 4742 non-null float64 41 NOx (g/mi) 4493 non-null float64 42 PM (g/mi) 669 non-null float64 43 CH4 (g/mi) 4090 non-null float64 44 N2O (g/mi) 2057 non-null float64 45 RND_ADJ_FE 4830 non-null float64 46 FE_UNIT 4832 non-null object 47 FE Bag 1 2130 non-null float64 48 FE Bag 2 2130 non-null float64 49 FE Bag 3 1892 non-null float64 50 FE Bag 4 90 non-null float64 51 DT-Inertia Work Ratio Rating 1602 non-null float64 52 DT-Absolute Speed Change Ratg 1602 non-null float64 53 DT-Energy Economy Rating 1602 non-null float64 54 Target Coef A (lbf) 4832 non-null float64 55 Target Coef B (lbf/mph) 4832 non-null float64 56 Target Coef C (lbf/mph**2) 4832 non-null float64 57 Set Coef A (lbf) 4832 non-null float64 58 Set Coef B (lbf/mph) 4832 non-null float64 59 Set Coef C (lbf/mph**2) 4832 non-null float64 60 Aftertreatment Device Cd 4651 non-null object 61 Aftertreatment Device Desc 4651 non-null object 62 Police - Emergency Vehicle? 4832 non-null object 63 Averaging Group ID 320 non-null object 64 Averaging Weighting Factor 320 non-null float64 65 Averaging Method Cd 4832 non-null object 66 Averging Method Desc 4832 non-null object 67 Year 4832 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.5+ MB None Data Structure for Year 2016: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4650 entries, 0 to 4649 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4650 non-null int64 1 Vehicle Manufacturer Name 4650 non-null object 2 Veh Mfr Code 4650 non-null object 3 Represented Test Veh Make 4650 non-null object 4 Represented Test Veh Model 4650 non-null object 5 Test Vehicle ID 4650 non-null object 6 Test Veh Configuration # 4650 non-null int64 7 Test Veh Displacement (L) 4650 non-null float64 8 Actual Tested Testgroup 4650 non-null object 9 Vehicle Type 4650 non-null object 10 Rated Horsepower 4650 non-null int64 11 # of Cylinders and Rotors 4510 non-null float64 12 Engine Code 4650 non-null object 13 Tested Transmission Type Code 4650 non-null object 14 Tested Transmission Type 4650 non-null object 15 # of Gears 4650 non-null int64 16 Transmission Lockup? 4650 non-null object 17 Drive System Code 4650 non-null object 18 Drive System Description 4650 non-null object 19 Transmission Overdrive Code 4650 non-null int64 20 Transmission Overdrive Desc 4650 non-null object 21 Equivalent Test Weight (lbs.) 4650 non-null int64 22 Axle Ratio 4650 non-null float64 23 N/V Ratio 4650 non-null float64 24 Shift Indicator Light Use Cd 4650 non-null int64 25 Shift Indicator Light Use Desc 4650 non-null object 26 Test Number 4650 non-null object 27 Test Originator 4650 non-null object 28 Analytically Derived FE? 4650 non-null object 29 ADFE Test Number 466 non-null object 30 ADFE Total Road Load HP 466 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 468 non-null float64 32 ADFE N/V Ratio 466 non-null float64 33 Test Procedure Cd 4650 non-null int64 34 Test Procedure Description 4650 non-null object 35 Test Fuel Type Cd 4650 non-null int64 36 Test Fuel Type Description 4650 non-null object 37 Test Category 4650 non-null object 38 THC (g/mi) 4213 non-null float64 39 CO (g/mi) 4214 non-null float64 40 CO2 (g/mi) 4508 non-null float64 41 NOx (g/mi) 4175 non-null float64 42 PM (g/mi) 653 non-null float64 43 CH4 (g/mi) 3856 non-null float64 44 N2O (g/mi) 1727 non-null float64 45 RND_ADJ_FE 4646 non-null float64 46 FE_UNIT 4650 non-null object 47 FE Bag 1 2112 non-null float64 48 FE Bag 2 2112 non-null float64 49 FE Bag 3 1863 non-null float64 50 FE Bag 4 117 non-null float64 51 DT-Inertia Work Ratio Rating 2519 non-null float64 52 DT-Absolute Speed Change Ratg 2519 non-null float64 53 DT-Energy Economy Rating 2519 non-null float64 54 Target Coef A (lbf) 4650 non-null float64 55 Target Coef B (lbf/mph) 4650 non-null float64 56 Target Coef C (lbf/mph**2) 4650 non-null float64 57 Set Coef A (lbf) 4650 non-null float64 58 Set Coef B (lbf/mph) 4650 non-null float64 59 Set Coef C (lbf/mph**2) 4650 non-null float64 60 Aftertreatment Device Cd 4499 non-null object 61 Aftertreatment Device Desc 4499 non-null object 62 Police - Emergency Vehicle? 4650 non-null object 63 Averaging Group ID 357 non-null object 64 Averaging Weighting Factor 357 non-null float64 65 Averaging Method Cd 4650 non-null object 66 Averging Method Desc 4650 non-null object 67 Year 4650 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.4+ MB None Data Structure for Year 2017: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4685 entries, 0 to 4684 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4685 non-null int64 1 Vehicle Manufacturer Name 4685 non-null object 2 Veh Mfr Code 4685 non-null object 3 Represented Test Veh Make 4685 non-null object 4 Represented Test Veh Model 4685 non-null object 5 Test Vehicle ID 4685 non-null object 6 Test Veh Configuration # 4685 non-null int64 7 Test Veh Displacement (L) 4685 non-null float64 8 Actual Tested Testgroup 4685 non-null object 9 Vehicle Type 4685 non-null object 10 Rated Horsepower 4685 non-null int64 11 # of Cylinders and Rotors 4570 non-null float64 12 Engine Code 4685 non-null object 13 Tested Transmission Type Code 4685 non-null object 14 Tested Transmission Type 4685 non-null object 15 # of Gears 4685 non-null int64 16 Transmission Lockup? 4685 non-null object 17 Drive System Code 4685 non-null object 18 Drive System Description 4685 non-null object 19 Transmission Overdrive Code 4685 non-null int64 20 Transmission Overdrive Desc 4685 non-null object 21 Equivalent Test Weight (lbs.) 4685 non-null int64 22 Axle Ratio 4685 non-null float64 23 N/V Ratio 4685 non-null float64 24 Shift Indicator Light Use Cd 4685 non-null int64 25 Shift Indicator Light Use Desc 4685 non-null object 26 Test Number 4685 non-null object 27 Test Originator 4685 non-null object 28 Analytically Derived FE? 4685 non-null object 29 ADFE Test Number 566 non-null object 30 ADFE Total Road Load HP 566 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 566 non-null float64 32 ADFE N/V Ratio 566 non-null float64 33 Test Procedure Cd 4685 non-null int64 34 Test Procedure Description 4685 non-null object 35 Test Fuel Type Cd 4685 non-null int64 36 Test Fuel Type Description 4685 non-null object 37 Test Category 4685 non-null object 38 THC (g/mi) 4165 non-null float64 39 CO (g/mi) 4166 non-null float64 40 CO2 (g/mi) 4526 non-null float64 41 NOx (g/mi) 4112 non-null float64 42 PM (g/mi) 627 non-null float64 43 CH4 (g/mi) 3974 non-null float64 44 N2O (g/mi) 1855 non-null float64 45 RND_ADJ_FE 4683 non-null float64 46 FE_UNIT 4685 non-null object 47 FE Bag 1 2165 non-null float64 48 FE Bag 2 2165 non-null float64 49 FE Bag 3 1889 non-null float64 50 FE Bag 4 122 non-null float64 51 DT-Inertia Work Ratio Rating 3634 non-null float64 52 DT-Absolute Speed Change Ratg 3634 non-null float64 53 DT-Energy Economy Rating 3634 non-null float64 54 Target Coef A (lbf) 4685 non-null float64 55 Target Coef B (lbf/mph) 4685 non-null float64 56 Target Coef C (lbf/mph**2) 4685 non-null float64 57 Set Coef A (lbf) 4685 non-null float64 58 Set Coef B (lbf/mph) 4685 non-null float64 59 Set Coef C (lbf/mph**2) 4685 non-null float64 60 Aftertreatment Device Cd 4559 non-null object 61 Aftertreatment Device Desc 4559 non-null object 62 Police - Emergency Vehicle? 4685 non-null object 63 Averaging Group ID 348 non-null object 64 Averaging Weighting Factor 348 non-null float64 65 Averaging Method Cd 4685 non-null object 66 Averging Method Desc 4685 non-null object 67 Year 4685 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.4+ MB None Data Structure for Year 2018: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4727 entries, 0 to 4726 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4727 non-null int64 1 Vehicle Manufacturer Name 4727 non-null object 2 Veh Mfr Code 4727 non-null object 3 Represented Test Veh Make 4727 non-null object 4 Represented Test Veh Model 4727 non-null object 5 Test Vehicle ID 4727 non-null object 6 Test Veh Configuration # 4727 non-null int64 7 Test Veh Displacement (L) 4727 non-null float64 8 Actual Tested Testgroup 4727 non-null object 9 Vehicle Type 4727 non-null object 10 Rated Horsepower 4727 non-null int64 11 # of Cylinders and Rotors 4626 non-null float64 12 Engine Code 4727 non-null object 13 Tested Transmission Type Code 4727 non-null object 14 Tested Transmission Type 4727 non-null object 15 # of Gears 4727 non-null int64 16 Transmission Lockup? 4727 non-null object 17 Drive System Code 4727 non-null object 18 Drive System Description 4727 non-null object 19 Transmission Overdrive Code 4727 non-null int64 20 Transmission Overdrive Desc 4727 non-null object 21 Equivalent Test Weight (lbs.) 4727 non-null int64 22 Axle Ratio 4727 non-null float64 23 N/V Ratio 4727 non-null float64 24 Shift Indicator Light Use Cd 4727 non-null int64 25 Shift Indicator Light Use Desc 4727 non-null object 26 Test Number 4727 non-null object 27 Test Originator 4727 non-null object 28 Analytically Derived FE? 4727 non-null object 29 ADFE Test Number 540 non-null object 30 ADFE Total Road Load HP 540 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 540 non-null float64 32 ADFE N/V Ratio 540 non-null float64 33 Test Procedure Cd 4727 non-null int64 34 Test Procedure Description 4727 non-null object 35 Test Fuel Type Cd 4727 non-null int64 36 Test Fuel Type Description 4727 non-null object 37 Test Category 4727 non-null object 38 THC (g/mi) 4252 non-null float64 39 CO (g/mi) 4254 non-null float64 40 CO2 (g/mi) 4568 non-null float64 41 NOx (g/mi) 4203 non-null float64 42 PM (g/mi) 904 non-null float64 43 CH4 (g/mi) 4045 non-null float64 44 N2O (g/mi) 2206 non-null float64 45 RND_ADJ_FE 4716 non-null float64 46 FE_UNIT 4727 non-null object 47 FE Bag 1 2245 non-null float64 48 FE Bag 2 2245 non-null float64 49 FE Bag 3 1949 non-null float64 50 FE Bag 4 120 non-null float64 51 DT-Inertia Work Ratio Rating 4092 non-null float64 52 DT-Absolute Speed Change Ratg 4092 non-null float64 53 DT-Energy Economy Rating 4092 non-null float64 54 Target Coef A (lbf) 4727 non-null float64 55 Target Coef B (lbf/mph) 4727 non-null float64 56 Target Coef C (lbf/mph**2) 4727 non-null float64 57 Set Coef A (lbf) 4727 non-null float64 58 Set Coef B (lbf/mph) 4727 non-null float64 59 Set Coef C (lbf/mph**2) 4727 non-null float64 60 Aftertreatment Device Cd 4613 non-null object 61 Aftertreatment Device Desc 4613 non-null object 62 Police - Emergency Vehicle? 4727 non-null object 63 Averaging Group ID 223 non-null object 64 Averaging Weighting Factor 223 non-null float64 65 Averaging Method Cd 4727 non-null object 66 Averging Method Desc 4727 non-null object 67 Year 4727 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.5+ MB None Data Structure for Year 2019: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4719 entries, 0 to 4718 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4719 non-null int64 1 Vehicle Manufacturer Name 4719 non-null object 2 Veh Mfr Code 4719 non-null object 3 Represented Test Veh Make 4719 non-null object 4 Represented Test Veh Model 4719 non-null object 5 Test Vehicle ID 4719 non-null object 6 Test Veh Configuration # 4719 non-null int64 7 Test Veh Displacement (L) 4719 non-null float64 8 Actual Tested Testgroup 4719 non-null object 9 Vehicle Type 4719 non-null object 10 Rated Horsepower 4719 non-null int64 11 # of Cylinders and Rotors 4564 non-null float64 12 Engine Code 4719 non-null object 13 Tested Transmission Type Code 4719 non-null object 14 Tested Transmission Type 4719 non-null object 15 # of Gears 4719 non-null int64 16 Transmission Lockup? 4719 non-null object 17 Drive System Code 4719 non-null object 18 Drive System Description 4719 non-null object 19 Transmission Overdrive Code 4719 non-null int64 20 Transmission Overdrive Desc 4719 non-null object 21 Equivalent Test Weight (lbs.) 4719 non-null int64 22 Axle Ratio 4719 non-null float64 23 N/V Ratio 4719 non-null float64 24 Shift Indicator Light Use Cd 4719 non-null int64 25 Shift Indicator Light Use Desc 4719 non-null object 26 Test Number 4719 non-null object 27 Test Originator 4719 non-null object 28 Analytically Derived FE? 4719 non-null object 29 ADFE Test Number 418 non-null object 30 ADFE Total Road Load HP 418 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 418 non-null float64 32 ADFE N/V Ratio 418 non-null float64 33 Test Procedure Cd 4719 non-null int64 34 Test Procedure Description 4719 non-null object 35 Test Fuel Type Cd 4719 non-null int64 36 Test Fuel Type Description 4719 non-null object 37 Test Category 4719 non-null object 38 THC (g/mi) 4150 non-null float64 39 CO (g/mi) 4159 non-null float64 40 CO2 (g/mi) 4497 non-null float64 41 NOx (g/mi) 4107 non-null float64 42 PM (g/mi) 776 non-null float64 43 CH4 (g/mi) 3924 non-null float64 44 N2O (g/mi) 2329 non-null float64 45 RND_ADJ_FE 4689 non-null float64 46 FE_UNIT 4719 non-null object 47 FE Bag 1 2242 non-null float64 48 FE Bag 2 2242 non-null float64 49 FE Bag 3 1902 non-null float64 50 FE Bag 4 136 non-null float64 51 DT-Inertia Work Ratio Rating 4369 non-null float64 52 DT-Absolute Speed Change Ratg 4369 non-null float64 53 DT-Energy Economy Rating 4369 non-null float64 54 Target Coef A (lbf) 4719 non-null float64 55 Target Coef B (lbf/mph) 4719 non-null float64 56 Target Coef C (lbf/mph**2) 4719 non-null float64 57 Set Coef A (lbf) 4719 non-null float64 58 Set Coef B (lbf/mph) 4719 non-null float64 59 Set Coef C (lbf/mph**2) 4719 non-null float64 60 Aftertreatment Device Cd 4551 non-null object 61 Aftertreatment Device Desc 4551 non-null object 62 Police - Emergency Vehicle? 4719 non-null object 63 Averaging Group ID 209 non-null object 64 Averaging Weighting Factor 209 non-null float64 65 Averaging Method Cd 4719 non-null object 66 Averging Method Desc 4719 non-null object 67 Year 4719 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.4+ MB None Data Structure for Year 2020: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4450 entries, 0 to 4449 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4450 non-null int64 1 Vehicle Manufacturer Name 4450 non-null object 2 Veh Mfr Code 4450 non-null object 3 Represented Test Veh Make 4450 non-null object 4 Represented Test Veh Model 4450 non-null object 5 Test Vehicle ID 4450 non-null object 6 Test Veh Configuration # 4450 non-null int64 7 Test Veh Displacement (L) 4450 non-null float64 8 Actual Tested Testgroup 4450 non-null object 9 Vehicle Type 4450 non-null object 10 Rated Horsepower 4450 non-null int64 11 # of Cylinders and Rotors 4278 non-null float64 12 Engine Code 4450 non-null object 13 Tested Transmission Type Code 4450 non-null object 14 Tested Transmission Type 4450 non-null object 15 # of Gears 4450 non-null int64 16 Transmission Lockup? 4450 non-null object 17 Drive System Code 4450 non-null object 18 Drive System Description 4450 non-null object 19 Transmission Overdrive Code 4450 non-null int64 20 Transmission Overdrive Desc 4450 non-null object 21 Equivalent Test Weight (lbs.) 4450 non-null int64 22 Axle Ratio 4450 non-null float64 23 N/V Ratio 4450 non-null float64 24 Shift Indicator Light Use Cd 4450 non-null int64 25 Shift Indicator Light Use Desc 4450 non-null object 26 Test Number 4450 non-null object 27 Test Originator 4450 non-null object 28 Analytically Derived FE? 4450 non-null object 29 ADFE Test Number 388 non-null object 30 ADFE Total Road Load HP 388 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 388 non-null float64 32 ADFE N/V Ratio 388 non-null float64 33 Test Procedure Cd 4450 non-null int64 34 Test Procedure Description 4450 non-null object 35 Test Fuel Type Cd 4450 non-null int64 36 Test Fuel Type Description 4450 non-null object 37 Test Category 4450 non-null object 38 THC (g/mi) 3935 non-null float64 39 CO (g/mi) 3947 non-null float64 40 CO2 (g/mi) 4248 non-null float64 41 NOx (g/mi) 3890 non-null float64 42 PM (g/mi) 754 non-null float64 43 CH4 (g/mi) 3692 non-null float64 44 N2O (g/mi) 2287 non-null float64 45 RND_ADJ_FE 4408 non-null float64 46 FE_UNIT 4450 non-null object 47 FE Bag 1 2159 non-null float64 48 FE Bag 2 2159 non-null float64 49 FE Bag 3 1790 non-null float64 50 FE Bag 4 143 non-null float64 51 DT-Inertia Work Ratio Rating 4246 non-null float64 52 DT-Absolute Speed Change Ratg 4246 non-null float64 53 DT-Energy Economy Rating 4246 non-null float64 54 Target Coef A (lbf) 4450 non-null float64 55 Target Coef B (lbf/mph) 4450 non-null float64 56 Target Coef C (lbf/mph**2) 4450 non-null float64 57 Set Coef A (lbf) 4450 non-null float64 58 Set Coef B (lbf/mph) 4450 non-null float64 59 Set Coef C (lbf/mph**2) 4450 non-null float64 60 Aftertreatment Device Cd 4273 non-null object 61 Aftertreatment Device Desc 4273 non-null object 62 Police - Emergency Vehicle? 4450 non-null object 63 Averaging Group ID 94 non-null object 64 Averaging Weighting Factor 94 non-null float64 65 Averaging Method Cd 4450 non-null object 66 Averging Method Desc 4450 non-null object 67 Year 4450 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.3+ MB None Data Structure for Year 2021: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4265 entries, 0 to 4264 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4265 non-null int64 1 Vehicle Manufacturer Name 4265 non-null object 2 Veh Mfr Code 4265 non-null object 3 Represented Test Veh Make 4265 non-null object 4 Represented Test Veh Model 4265 non-null object 5 Test Vehicle ID 4265 non-null object 6 Test Veh Configuration # 4265 non-null int64 7 Test Veh Displacement (L) 4265 non-null float64 8 Actual Tested Testgroup 4265 non-null object 9 Vehicle Type 4265 non-null object 10 Rated Horsepower 4265 non-null int64 11 # of Cylinders and Rotors 4055 non-null float64 12 Engine Code 4265 non-null object 13 Tested Transmission Type Code 4265 non-null object 14 Tested Transmission Type 4265 non-null object 15 # of Gears 4265 non-null int64 16 Transmission Lockup? 4265 non-null object 17 Drive System Code 4265 non-null object 18 Drive System Description 4265 non-null object 19 Transmission Overdrive Code 4265 non-null int64 20 Transmission Overdrive Desc 4265 non-null object 21 Equivalent Test Weight (lbs.) 4265 non-null int64 22 Axle Ratio 4265 non-null float64 23 N/V Ratio 4265 non-null float64 24 Shift Indicator Light Use Cd 4265 non-null int64 25 Shift Indicator Light Use Desc 4265 non-null object 26 Test Number 4265 non-null object 27 Test Originator 4265 non-null object 28 Analytically Derived FE? 4265 non-null object 29 ADFE Test Number 446 non-null object 30 ADFE Total Road Load HP 446 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 446 non-null float64 32 ADFE N/V Ratio 446 non-null float64 33 Test Procedure Cd 4265 non-null int64 34 Test Procedure Description 4265 non-null object 35 Test Fuel Type Cd 4265 non-null int64 36 Test Fuel Type Description 4265 non-null object 37 Test Category 4265 non-null object 38 THC (g/mi) 3634 non-null float64 39 CO (g/mi) 3652 non-null float64 40 CO2 (g/mi) 3987 non-null float64 41 NOx (g/mi) 3610 non-null float64 42 PM (g/mi) 779 non-null float64 43 CH4 (g/mi) 3359 non-null float64 44 N2O (g/mi) 2233 non-null float64 45 RND_ADJ_FE 4162 non-null float64 46 FE_UNIT 4265 non-null object 47 FE Bag 1 2030 non-null float64 48 FE Bag 2 2030 non-null float64 49 FE Bag 3 1657 non-null float64 50 FE Bag 4 219 non-null float64 51 DT-Inertia Work Ratio Rating 4010 non-null float64 52 DT-Absolute Speed Change Ratg 4010 non-null float64 53 DT-Energy Economy Rating 4010 non-null float64 54 Target Coef A (lbf) 4265 non-null float64 55 Target Coef B (lbf/mph) 4265 non-null float64 56 Target Coef C (lbf/mph**2) 4265 non-null float64 57 Set Coef A (lbf) 4265 non-null float64 58 Set Coef B (lbf/mph) 4265 non-null float64 59 Set Coef C (lbf/mph**2) 4265 non-null float64 60 Aftertreatment Device Cd 4055 non-null object 61 Aftertreatment Device Desc 4055 non-null object 62 Police - Emergency Vehicle? 4265 non-null object 63 Averaging Group ID 78 non-null object 64 Averaging Weighting Factor 78 non-null float64 65 Averaging Method Cd 4265 non-null object 66 Averging Method Desc 4265 non-null object 67 Year 4265 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.2+ MB None Data Structure for Year 2022: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4493 entries, 0 to 4492 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4493 non-null int64 1 Vehicle Manufacturer Name 4493 non-null object 2 Veh Mfr Code 4493 non-null object 3 Represented Test Veh Make 4493 non-null object 4 Represented Test Veh Model 4493 non-null object 5 Test Vehicle ID 4493 non-null object 6 Test Veh Configuration # 4493 non-null int64 7 Test Veh Displacement (L) 4493 non-null float64 8 Actual Tested Testgroup 4493 non-null object 9 Vehicle Type 4493 non-null object 10 Rated Horsepower 4493 non-null int64 11 # of Cylinders and Rotors 4205 non-null float64 12 Engine Code 4493 non-null object 13 Tested Transmission Type Code 4493 non-null object 14 Tested Transmission Type 4493 non-null object 15 # of Gears 4493 non-null int64 16 Transmission Lockup? 4493 non-null object 17 Drive System Code 4493 non-null object 18 Drive System Description 4493 non-null object 19 Transmission Overdrive Code 4493 non-null int64 20 Transmission Overdrive Desc 4493 non-null object 21 Equivalent Test Weight (lbs.) 4493 non-null int64 22 Axle Ratio 4493 non-null float64 23 N/V Ratio 4493 non-null float64 24 Shift Indicator Light Use Cd 4493 non-null int64 25 Shift Indicator Light Use Desc 4493 non-null object 26 Test Number 4493 non-null object 27 Test Originator 4493 non-null object 28 Analytically Derived FE? 4493 non-null object 29 ADFE Test Number 668 non-null object 30 ADFE Total Road Load HP 668 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 668 non-null float64 32 ADFE N/V Ratio 668 non-null float64 33 Test Procedure Cd 4493 non-null int64 34 Test Procedure Description 4493 non-null object 35 Test Fuel Type Cd 4493 non-null int64 36 Test Fuel Type Description 4493 non-null object 37 Test Category 4493 non-null object 38 THC (g/mi) 3666 non-null float64 39 CO (g/mi) 3682 non-null float64 40 CO2 (g/mi) 4098 non-null float64 41 NOx (g/mi) 3668 non-null float64 42 PM (g/mi) 728 non-null float64 43 CH4 (g/mi) 3385 non-null float64 44 N2O (g/mi) 2319 non-null float64 45 RND_ADJ_FE 4368 non-null float64 46 FE_UNIT 4493 non-null object 47 FE Bag 1 2098 non-null float64 48 FE Bag 2 2098 non-null float64 49 FE Bag 3 1671 non-null float64 50 FE Bag 4 234 non-null float64 51 DT-Inertia Work Ratio Rating 4123 non-null float64 52 DT-Absolute Speed Change Ratg 4123 non-null float64 53 DT-Energy Economy Rating 4123 non-null float64 54 Target Coef A (lbf) 4493 non-null float64 55 Target Coef B (lbf/mph) 4493 non-null float64 56 Target Coef C (lbf/mph**2) 4493 non-null float64 57 Set Coef A (lbf) 4493 non-null float64 58 Set Coef B (lbf/mph) 4493 non-null float64 59 Set Coef C (lbf/mph**2) 4493 non-null float64 60 Aftertreatment Device Cd 4205 non-null object 61 Aftertreatment Device Desc 4205 non-null object 62 Police - Emergency Vehicle? 4493 non-null object 63 Averaging Group ID 62 non-null object 64 Averaging Weighting Factor 62 non-null float64 65 Averaging Method Cd 4493 non-null object 66 Averging Method Desc 4493 non-null object 67 Year 4493 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.3+ MB None Data Structure for Year 2023: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4521 entries, 0 to 4520 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4521 non-null int64 1 Vehicle Manufacturer Name 4521 non-null object 2 Veh Mfr Code 4521 non-null object 3 Represented Test Veh Make 4521 non-null object 4 Represented Test Veh Model 4521 non-null object 5 Test Vehicle ID 4521 non-null object 6 Test Veh Configuration # 4521 non-null int64 7 Test Veh Displacement (L) 4521 non-null float64 8 Actual Tested Testgroup 4521 non-null object 9 Vehicle Type 4521 non-null object 10 Rated Horsepower 4521 non-null int64 11 # of Cylinders and Rotors 4065 non-null float64 12 Engine Code 4510 non-null object 13 Tested Transmission Type Code 4521 non-null object 14 Tested Transmission Type 4521 non-null object 15 # of Gears 4521 non-null int64 16 Transmission Lockup? 4521 non-null object 17 Drive System Code 4521 non-null object 18 Drive System Description 4521 non-null object 19 Transmission Overdrive Code 4521 non-null int64 20 Transmission Overdrive Desc 4521 non-null object 21 Equivalent Test Weight (lbs.) 4521 non-null int64 22 Axle Ratio 4521 non-null float64 23 N/V Ratio 4521 non-null float64 24 Shift Indicator Light Use Cd 4521 non-null int64 25 Shift Indicator Light Use Desc 4521 non-null object 26 Test Number 4521 non-null object 27 Test Originator 4521 non-null object 28 Analytically Derived FE? 4521 non-null object 29 ADFE Test Number 551 non-null object 30 ADFE Total Road Load HP 551 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 551 non-null float64 32 ADFE N/V Ratio 551 non-null float64 33 Test Procedure Cd 4521 non-null int64 34 Test Procedure Description 4521 non-null object 35 Test Fuel Type Cd 4521 non-null int64 36 Test Fuel Type Description 4521 non-null object 37 Test Category 4521 non-null object 38 THC (g/mi) 3659 non-null float64 39 CO (g/mi) 3676 non-null float64 40 CO2 (g/mi) 3939 non-null float64 41 NOx (g/mi) 3667 non-null float64 42 PM (g/mi) 818 non-null float64 43 CH4 (g/mi) 3377 non-null float64 44 N2O (g/mi) 2399 non-null float64 45 RND_ADJ_FE 4385 non-null float64 46 FE_UNIT 4521 non-null object 47 FE Bag 1 2075 non-null float64 48 FE Bag 2 2075 non-null float64 49 FE Bag 3 1648 non-null float64 50 FE Bag 4 276 non-null float64 51 DT-Inertia Work Ratio Rating 4001 non-null float64 52 DT-Absolute Speed Change Ratg 4001 non-null float64 53 DT-Energy Economy Rating 4001 non-null float64 54 Target Coef A (lbf) 4521 non-null float64 55 Target Coef B (lbf/mph) 4521 non-null float64 56 Target Coef C (lbf/mph**2) 4521 non-null float64 57 Set Coef A (lbf) 4521 non-null float64 58 Set Coef B (lbf/mph) 4521 non-null float64 59 Set Coef C (lbf/mph**2) 4521 non-null float64 60 Aftertreatment Device Cd 4065 non-null object 61 Aftertreatment Device Desc 4065 non-null object 62 Police - Emergency Vehicle? 4521 non-null object 63 Averaging Group ID 56 non-null object 64 Averaging Weighting Factor 56 non-null float64 65 Averaging Method Cd 4521 non-null object 66 Averging Method Desc 4521 non-null object 67 Year 4521 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.3+ MB None Data Structure for Year 2024: <class 'pandas.core.frame.DataFrame'> RangeIndex: 4137 entries, 0 to 4136 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model Year 4137 non-null int64 1 Vehicle Manufacturer Name 4137 non-null object 2 Veh Mfr Code 4137 non-null object 3 Represented Test Veh Make 4137 non-null object 4 Represented Test Veh Model 4137 non-null object 5 Test Vehicle ID 4137 non-null object 6 Test Veh Configuration # 4137 non-null int64 7 Test Veh Displacement (L) 4137 non-null float64 8 Actual Tested Testgroup 4137 non-null object 9 Vehicle Type 4137 non-null object 10 Rated Horsepower 4137 non-null int64 11 # of Cylinders and Rotors 3564 non-null float64 12 Engine Code 4124 non-null object 13 Tested Transmission Type Code 4137 non-null object 14 Tested Transmission Type 4137 non-null object 15 # of Gears 4137 non-null int64 16 Transmission Lockup? 4137 non-null object 17 Drive System Code 4137 non-null object 18 Drive System Description 4137 non-null object 19 Transmission Overdrive Code 4137 non-null int64 20 Transmission Overdrive Desc 4137 non-null object 21 Equivalent Test Weight (lbs.) 4137 non-null int64 22 Axle Ratio 4137 non-null float64 23 N/V Ratio 4137 non-null float64 24 Shift Indicator Light Use Cd 4137 non-null int64 25 Shift Indicator Light Use Desc 4137 non-null object 26 Test Number 4137 non-null object 27 Test Originator 4137 non-null object 28 Analytically Derived FE? 4137 non-null object 29 ADFE Test Number 305 non-null object 30 ADFE Total Road Load HP 305 non-null float64 31 ADFE Equiv. Test Weight (lbs.) 305 non-null float64 32 ADFE N/V Ratio 305 non-null float64 33 Test Procedure Cd 4137 non-null int64 34 Test Procedure Description 4137 non-null object 35 Test Fuel Type Cd 4137 non-null int64 36 Test Fuel Type Description 4137 non-null object 37 Test Category 4137 non-null object 38 THC (g/mi) 3278 non-null float64 39 CO (g/mi) 3297 non-null float64 40 CO2 (g/mi) 3453 non-null float64 41 NOx (g/mi) 3288 non-null float64 42 PM (g/mi) 539 non-null float64 43 CH4 (g/mi) 2961 non-null float64 44 N2O (g/mi) 2091 non-null float64 45 RND_ADJ_FE 3991 non-null float64 46 FE_UNIT 4137 non-null object 47 FE Bag 1 1813 non-null float64 48 FE Bag 2 1813 non-null float64 49 FE Bag 3 1425 non-null float64 50 FE Bag 4 307 non-null float64 51 DT-Inertia Work Ratio Rating 3494 non-null float64 52 DT-Absolute Speed Change Ratg 3494 non-null float64 53 DT-Energy Economy Rating 3494 non-null float64 54 Target Coef A (lbf) 4137 non-null float64 55 Target Coef B (lbf/mph) 4137 non-null float64 56 Target Coef C (lbf/mph**2) 4137 non-null float64 57 Set Coef A (lbf) 4137 non-null float64 58 Set Coef B (lbf/mph) 4137 non-null float64 59 Set Coef C (lbf/mph**2) 4137 non-null float64 60 Aftertreatment Device Cd 3564 non-null object 61 Aftertreatment Device Desc 3564 non-null object 62 Police - Emergency Vehicle? 4137 non-null object 63 Averaging Group ID 46 non-null object 64 Averaging Weighting Factor 46 non-null float64 65 Averaging Method Cd 4137 non-null object 66 Averging Method Desc 4137 non-null object 67 Year 4137 non-null object dtypes: float64(29), int64(9), object(30) memory usage: 2.1+ MB None Missing Values for Year 2015: # of Cylinders and Rotors 166 ADFE Test Number 4447 ADFE Total Road Load HP 4447 ADFE Equiv. Test Weight (lbs.) 4445 ADFE N/V Ratio 4447 THC (g/mi) 313 CO (g/mi) 302 CO2 (g/mi) 90 NOx (g/mi) 339 PM (g/mi) 4163 CH4 (g/mi) 742 N2O (g/mi) 2775 RND_ADJ_FE 2 FE Bag 1 2702 FE Bag 2 2702 FE Bag 3 2940 FE Bag 4 4742 DT-Inertia Work Ratio Rating 3230 DT-Absolute Speed Change Ratg 3230 DT-Energy Economy Rating 3230 Aftertreatment Device Cd 181 Aftertreatment Device Desc 181 Averaging Group ID 4512 Averaging Weighting Factor 4512 dtype: int64 Missing Values for Year 2016: # of Cylinders and Rotors 140 ADFE Test Number 4184 ADFE Total Road Load HP 4184 ADFE Equiv. Test Weight (lbs.) 4182 ADFE N/V Ratio 4184 THC (g/mi) 437 CO (g/mi) 436 CO2 (g/mi) 142 NOx (g/mi) 475 PM (g/mi) 3997 CH4 (g/mi) 794 N2O (g/mi) 2923 RND_ADJ_FE 4 FE Bag 1 2538 FE Bag 2 2538 FE Bag 3 2787 FE Bag 4 4533 DT-Inertia Work Ratio Rating 2131 DT-Absolute Speed Change Ratg 2131 DT-Energy Economy Rating 2131 Aftertreatment Device Cd 151 Aftertreatment Device Desc 151 Averaging Group ID 4293 Averaging Weighting Factor 4293 dtype: int64 Missing Values for Year 2017: # of Cylinders and Rotors 115 ADFE Test Number 4119 ADFE Total Road Load HP 4119 ADFE Equiv. Test Weight (lbs.) 4119 ADFE N/V Ratio 4119 THC (g/mi) 520 CO (g/mi) 519 CO2 (g/mi) 159 NOx (g/mi) 573 PM (g/mi) 4058 CH4 (g/mi) 711 N2O (g/mi) 2830 RND_ADJ_FE 2 FE Bag 1 2520 FE Bag 2 2520 FE Bag 3 2796 FE Bag 4 4563 DT-Inertia Work Ratio Rating 1051 DT-Absolute Speed Change Ratg 1051 DT-Energy Economy Rating 1051 Aftertreatment Device Cd 126 Aftertreatment Device Desc 126 Averaging Group ID 4337 Averaging Weighting Factor 4337 dtype: int64 Missing Values for Year 2018: # of Cylinders and Rotors 101 ADFE Test Number 4187 ADFE Total Road Load HP 4187 ADFE Equiv. Test Weight (lbs.) 4187 ADFE N/V Ratio 4187 THC (g/mi) 475 CO (g/mi) 473 CO2 (g/mi) 159 NOx (g/mi) 524 PM (g/mi) 3823 CH4 (g/mi) 682 N2O (g/mi) 2521 RND_ADJ_FE 11 FE Bag 1 2482 FE Bag 2 2482 FE Bag 3 2778 FE Bag 4 4607 DT-Inertia Work Ratio Rating 635 DT-Absolute Speed Change Ratg 635 DT-Energy Economy Rating 635 Aftertreatment Device Cd 114 Aftertreatment Device Desc 114 Averaging Group ID 4504 Averaging Weighting Factor 4504 dtype: int64 Missing Values for Year 2019: # of Cylinders and Rotors 155 ADFE Test Number 4301 ADFE Total Road Load HP 4301 ADFE Equiv. Test Weight (lbs.) 4301 ADFE N/V Ratio 4301 THC (g/mi) 569 CO (g/mi) 560 CO2 (g/mi) 222 NOx (g/mi) 612 PM (g/mi) 3943 CH4 (g/mi) 795 N2O (g/mi) 2390 RND_ADJ_FE 30 FE Bag 1 2477 FE Bag 2 2477 FE Bag 3 2817 FE Bag 4 4583 DT-Inertia Work Ratio Rating 350 DT-Absolute Speed Change Ratg 350 DT-Energy Economy Rating 350 Aftertreatment Device Cd 168 Aftertreatment Device Desc 168 Averaging Group ID 4510 Averaging Weighting Factor 4510 dtype: int64 Missing Values for Year 2020: # of Cylinders and Rotors 172 ADFE Test Number 4062 ADFE Total Road Load HP 4062 ADFE Equiv. Test Weight (lbs.) 4062 ADFE N/V Ratio 4062 THC (g/mi) 515 CO (g/mi) 503 CO2 (g/mi) 202 NOx (g/mi) 560 PM (g/mi) 3696 CH4 (g/mi) 758 N2O (g/mi) 2163 RND_ADJ_FE 42 FE Bag 1 2291 FE Bag 2 2291 FE Bag 3 2660 FE Bag 4 4307 DT-Inertia Work Ratio Rating 204 DT-Absolute Speed Change Ratg 204 DT-Energy Economy Rating 204 Aftertreatment Device Cd 177 Aftertreatment Device Desc 177 Averaging Group ID 4356 Averaging Weighting Factor 4356 dtype: int64 Missing Values for Year 2021: # of Cylinders and Rotors 210 ADFE Test Number 3819 ADFE Total Road Load HP 3819 ADFE Equiv. Test Weight (lbs.) 3819 ADFE N/V Ratio 3819 THC (g/mi) 631 CO (g/mi) 613 CO2 (g/mi) 278 NOx (g/mi) 655 PM (g/mi) 3486 CH4 (g/mi) 906 N2O (g/mi) 2032 RND_ADJ_FE 103 FE Bag 1 2235 FE Bag 2 2235 FE Bag 3 2608 FE Bag 4 4046 DT-Inertia Work Ratio Rating 255 DT-Absolute Speed Change Ratg 255 DT-Energy Economy Rating 255 Aftertreatment Device Cd 210 Aftertreatment Device Desc 210 Averaging Group ID 4187 Averaging Weighting Factor 4187 dtype: int64 Missing Values for Year 2022: # of Cylinders and Rotors 288 ADFE Test Number 3825 ADFE Total Road Load HP 3825 ADFE Equiv. Test Weight (lbs.) 3825 ADFE N/V Ratio 3825 THC (g/mi) 827 CO (g/mi) 811 CO2 (g/mi) 395 NOx (g/mi) 825 PM (g/mi) 3765 CH4 (g/mi) 1108 N2O (g/mi) 2174 RND_ADJ_FE 125 FE Bag 1 2395 FE Bag 2 2395 FE Bag 3 2822 FE Bag 4 4259 DT-Inertia Work Ratio Rating 370 DT-Absolute Speed Change Ratg 370 DT-Energy Economy Rating 370 Aftertreatment Device Cd 288 Aftertreatment Device Desc 288 Averaging Group ID 4431 Averaging Weighting Factor 4431 dtype: int64 Missing Values for Year 2023: # of Cylinders and Rotors 456 Engine Code 11 ADFE Test Number 3970 ADFE Total Road Load HP 3970 ADFE Equiv. Test Weight (lbs.) 3970 ADFE N/V Ratio 3970 THC (g/mi) 862 CO (g/mi) 845 CO2 (g/mi) 582 NOx (g/mi) 854 PM (g/mi) 3703 CH4 (g/mi) 1144 N2O (g/mi) 2122 RND_ADJ_FE 136 FE Bag 1 2446 FE Bag 2 2446 FE Bag 3 2873 FE Bag 4 4245 DT-Inertia Work Ratio Rating 520 DT-Absolute Speed Change Ratg 520 DT-Energy Economy Rating 520 Aftertreatment Device Cd 456 Aftertreatment Device Desc 456 Averaging Group ID 4465 Averaging Weighting Factor 4465 dtype: int64 Missing Values for Year 2024: # of Cylinders and Rotors 573 Engine Code 13 ADFE Test Number 3832 ADFE Total Road Load HP 3832 ADFE Equiv. Test Weight (lbs.) 3832 ADFE N/V Ratio 3832 THC (g/mi) 859 CO (g/mi) 840 CO2 (g/mi) 684 NOx (g/mi) 849 PM (g/mi) 3598 CH4 (g/mi) 1176 N2O (g/mi) 2046 RND_ADJ_FE 146 FE Bag 1 2324 FE Bag 2 2324 FE Bag 3 2712 FE Bag 4 3830 DT-Inertia Work Ratio Rating 643 DT-Absolute Speed Change Ratg 643 DT-Energy Economy Rating 643 Aftertreatment Device Cd 573 Aftertreatment Device Desc 573 Averaging Group ID 4091 Averaging Weighting Factor 4091 dtype: int64
Data Structure¶
The datasets for each year (from 2015 to 2024) contain a similar structure with approximately 68 columns. These columns represent various vehicle-related metrics such as engine displacement, coefficients for aerodynamic drag, emissions data, and other vehicle performance parameters.
- Common Columns: Some common columns across the years include:
- Vehicle Manufacturer Name
- Test Veh Displacement (L)
- Set Coef A (lbf)
- Emissions-related columns: CO2, NOx, PM, THC, CO
- Fuel efficiency and road load data: FE Bag 1 to FE Bag 4, ADFE Test Number
- Aerodynamic and mechanical data: Set Coef A, B, and C, DT-Inertia Work Ratio Rating
The structure is consistent across the datasets, which is beneficial for combining the data for analysis. However, there are significant missing values that need to be addressed.
Missing Values Across Each Year¶
For each dataset, the following columns have missing values:
Key Columns with Missing Values (Common across Years):¶
- # of Cylinders and Rotors
- ADFE Test Number
- ADFE Total Road Load HP
- ADFE Equiv. Test Weight (lbs.)
- ADFE N/V Ratio
- THC (g/mi)
- CO (g/mi)
- CO2 (g/mi)
- NOx (g/mi)
- PM (g/mi)
- CH4 (g/mi)
- N2O (g/mi)
- FE Bag 1 to FE Bag 4
- DT-Inertia Work Ratio Rating
- DT-Absolute Speed Change Rating
- DT-Energy Economy Rating
- Aftertreatment Device Cd & Desc
- Averaging Group ID & Weighting Factor
These columns represent important vehicle and emission-related data, and the missing values in these fields will need to be handled carefully for a meaningful analysis.
Year-wise Missing Data Interpretation¶
2015 Dataset:¶
- Highest Missing Values: The columns with the most missing data are PM (g/mi) (missing 4163 out of 4832 rows), N2O (g/mi) (missing 2775 rows), and FE Bags (Bag 1 to Bag 4 have missing values in more than 2700 rows).
- Key Insight: Missing data in emission-related variables, such as PM, THC, CO2, CO, and NOx, could affect emission trend analysis or regression models. This might reflect incomplete testing in earlier years, making it crucial to decide whether to impute or exclude these variables in future analysis.
2016 Dataset:¶
- Highest Missing Values: Similar to 2015, key missing columns include ADFE Test Number (missing 4184 rows out of 4650), PM (g/mi) (3997 missing), and N2O (g/mi) (2923 missing).
- Insight: The missing emission values, such as PM, CO2, NOx, indicate that emission tests were likely not conducted for all vehicles, or these tests had reporting issues.
2017 Dataset:¶
- Highest Missing Values: The ADFE fields, particularly ADFE Test Number (4119 rows missing out of 4685), remain the largest missing data category. PM (g/mi) is missing in 4058 rows, while N2O (g/mi) is missing in 2830 rows.
- Insight: The consistency of missing data in these emission-related columns across multiple years suggests a systematic issue in recording or reporting emission data for certain vehicles.
2018 Dataset:¶
- Highest Missing Values: A similar trend persists, with ADFE fields (4187 missing rows) and PM (g/mi) (3823 missing rows) being the most affected.
- Insight: The missing emission data, particularly in key pollutants like PM and NOx, might affect any models predicting vehicle emissions, as these pollutants are critical for analyzing environmental impact.
2019 Dataset:¶
- Highest Missing Values: The pattern continues with ADFE fields and emission-related columns like PM, CO2, NOx, and THC. For instance, PM (g/mi) is missing in 3943 rows out of 4719.
- Insight: The absence of key pollutants and energy ratings (DT-Inertia Work Ratio Rating) will impact temporal trend analysis in vehicle performance and emissions.
2020 Dataset:¶
- Highest Missing Values: The same columns continue to have significant missing data, with ADFE fields and PM (g/mi) being the most impacted.
- Insight: The missing values appear consistent across the years, indicating that these fields were either not measured or not reported for a substantial portion of vehicles.
2021 Dataset:¶
- Highest Missing Values: The ADFE fields have more than 3800 missing rows, while PM (g/mi) and N2O (g/mi) also have large gaps.
- Insight: Despite improvements in vehicle testing, certain emission and aerodynamic measurements remain underreported, likely affecting regression model accuracy if these variables are crucial predictors.
2022 Dataset:¶
- Highest Missing Values: The trends from earlier years persist, with the ADFE fields and PM (g/mi) missing in a large portion of the data.
- Insight: Handling these missing values through imputation or dropping the variables becomes important to ensure the dataset’s integrity for further analysis.
2023 Dataset:¶
- Highest Missing Values: The columns with the most missing values continue to be PM (g/mi) (3703 rows missing out of 4521) and ADFE fields.
- Insight: Similar patterns of missing values over the years suggest that these fields are not consistently reported across different vehicle tests. The missing data in important pollution metrics will affect any model attempting to predict emissions or fuel efficiency.
2024 Dataset:¶
- Highest Missing Values: ADFE fields and PM (g/mi) continue to have the most missing data. CO2 (g/mi) and NOx (g/mi) are also missing in a significant number of rows.
- Insight: Missing key emission data in the most recent year indicates that certain pollutants were either not measured or that this data was not available during testing.
Interpretation of Missing Data and Next Steps¶
Consistent Missing Data Across Key Emission Variables:
- Columns like PM (g/mi), NOx (g/mi), THC (g/mi), and CO2 (g/mi) have consistent missing values across most years. This suggests that certain vehicles were not tested for these pollutants, or the data was not recorded. This will impact models focusing on emission prediction, as some vehicles lack critical information.
Handling Missing Data in ADFE Fields:
- The ADFE fields (Test Number, Total Road Load HP, Equivalent Test Weight) have missing data in most of the datasets. These fields are related to road load testing, and their absence might affect predictions related to vehicle energy consumption or aerodynamic properties. we might consider removing these columns if they are not central to our analysis or imputing missing values if relevant.
Emission Variables:
- The emission variables THC (g/mi), CO (g/mi), CO2 (g/mi), and NOx (g/mi) are critical for environmental impact analysis. Since these variables have a high proportion of missing values, we’ll need to decide whether to impute missing values using techniques like mean/median imputation or predictive imputation, or remove them from further analysis if their contribution is minimal.
FE Bag Data:
- The FE Bags (1-4) are related to fuel economy testing across different driving conditions (e.g., city, highway). Missing values in these fields suggest that some vehicles were not tested for fuel efficiency across all driving scenarios. Since fuel economy is a central metric for many analyses, imputing or removing these values will be essential before building models.
Aftertreatment Device Information:
- The Aftertreatment Device fields (Cd and Desc) have relatively fewer missing values compared to other columns. If this information is important for our analysis, we might choose to impute these values based on vehicle type or engine characteristics.
Handling Missing Data¶
Given the structure and the nature of the missing values, different strategies can be applied depending on the importance of the columns for wer regression analysis. Here’s how we could handle the missing values:
1. For Columns with Minimal Impact:¶
- ADFE Fields: Given that these fields (e.g., ADFE Test Number, ADFE Total Road Load HP) are consistently missing across all years, they might not be critical to the regression analysis focused on emissions and fuel economy. We can consider removing these columns from analysis.
2. For Emission-related Columns:¶
- Imputation for Missing Emission Data: Since emissions data like PM, CO2, NOx, and THC are crucial for understanding environmental impact, imputation techniques can be applied. Possible methods include:
- Mean/Median Imputation: For variables with moderate missingness (e.g., CO2 and NOx), we can impute missing values using the mean or median, which is especially useful if the data distribution is relatively normal.
- KNN Imputation: This technique can be used for columns with more complex relationships (e.g., PM, NOx). KNN imputation uses the nearest neighbors" data to fill in missing values, which can be effective for variables that correlate with other columns.
3. For Aftertreatment Device and Cylinders/Rotors Data:¶
- Drop or Impute: Since these columns have fewer missing values (compared to emissions data), we can either:
- Impute missing values using the most frequent category or other relevant imputation techniques.
- Drop rows where these values are missing if they aren’t central to our analysis.
3. Comparing Column Names Across Years ¶
# Collect column names for each dataframe
column_names = {}
for df in dfs:
year = df["Year"].iloc[0]
column_names[year] = df.columns.tolist()
# Display column names for each year
for year, columns in column_names.items():
print(f"Columns for Year {year}:")
print(columns)
print("\n")
Columns for Year 2015: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2016: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2017: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2018: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2019: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2020: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2021: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2022: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2023: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year'] Columns for Year 2024: ['Model Year', 'Vehicle Manufacturer Name', 'Veh Mfr Code', 'Represented Test Veh Make', 'Represented Test Veh Model', 'Test Vehicle ID', 'Test Veh Configuration #', 'Test Veh Displacement (L)', 'Actual Tested Testgroup', 'Vehicle Type', 'Rated Horsepower', '# of Cylinders and Rotors', 'Engine Code', 'Tested Transmission Type Code', 'Tested Transmission Type', '# of Gears', 'Transmission Lockup?', 'Drive System Code', 'Drive System Description', 'Transmission Overdrive Code', 'Transmission Overdrive Desc', 'Equivalent Test Weight (lbs.)', 'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd', 'Shift Indicator Light Use Desc', 'Test Number', 'Test Originator', 'Analytically Derived FE?', 'ADFE Test Number', 'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)', 'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Procedure Description', 'Test Fuel Type Cd', 'Test Fuel Type Description', 'Test Category', 'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)', 'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE_UNIT', 'FE Bag 1', 'FE Bag 2', 'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating', 'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating', 'Target Coef A (lbf)', 'Target Coef B (lbf/mph)', 'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)', 'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)', 'Aftertreatment Device Cd', 'Aftertreatment Device Desc', 'Police - Emergency Vehicle?', 'Averaging Group ID', 'Averaging Weighting Factor', 'Averaging Method Cd', 'Averging Method Desc', 'Year']
All column names are consistent across the years and there are no missing columns. This consistency will help us combine the datasets easily.
4. Distribution of Key Variables ¶
from matplotlib.axes import Axes
# Plot the distribution of "Test Veh Displacement (L)" and "Set Coef A (lbf)" for each year
for df in dfs:
year = df["Year"].iloc[0]
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax: tuple[Axes, Axes]
sns.histplot(df["Test Veh Displacement (L)"], ax=ax[0], kde=True) # type: ignore
ax[0].set_title(f"Distribution of Test Veh Displacement (L) in {year}")
sns.histplot(df["Set Coef A (lbf)"], ax=ax[1], kde=True) # type: ignore
ax[1].set_title(f"Distribution of Set Coef A (lbf) in {year}")
plt.tight_layout()
plt.show()
Distribution of "Test Veh Displacement (L)"¶
Key Observations:¶
General Shape: The distribution of "Test Veh Displacement (L)" is generally right-skewed for most years, indicating that most vehicles tend to have smaller engine displacements, while a smaller number of vehicles have larger engines.
Displacement Ranges:
- The bulk of vehicles have engine displacements ranging from 1.5L to 4.0L, with a concentration around 2.0L to 3.0L in most years. These are likely typical passenger cars and smaller SUVs.
- Larger vehicles (e.g., trucks, large SUVs) with displacements greater than 4.0L are fewer, but consistently present throughout the years. These larger displacement vehicles contribute to the right-skewed nature of the distribution.
- Extreme values of 99.999L (which may indicate outliers or errors) are detected in several years but occur infrequently.
Trends Over Time:¶
Decreasing Displacement Range (2020-2024):
- From 2020 onward, there is a slight shift toward smaller engine displacements, suggesting that vehicles with smaller engines are becoming more prevalent. This could be due to the increased adoption of fuel-efficient technologies (e.g., turbocharged smaller engines) and stricter emissions standards.
- This trend is visible in the narrowing of the distribution between 2.0L and 3.0L displacements, with fewer vehicles having very large displacements (> 6.0L).
Increased Number of Small-Displacement Vehicles:
- In recent years (2022-2024), there is an increase in the number of vehicles with very small displacements (around 1.0L to 1.5L). This could be attributed to the rising popularity of electric vehicles (EVs), hybrids, and small, fuel-efficient cars.
Consistent Presence of Larger Vehicles:
- Although the average engine displacement appears to be shifting downward, larger vehicles (e.g., trucks and SUVs) with displacements around 5.0L to 6.0L continue to be part of the dataset, reflecting ongoing demand for heavy-duty vehicles in certain market segments.
Visible Trends:¶
- Downsizing Trend: There is a clear trend towards smaller displacement engines from 2020 onwards, likely due to regulatory pressures, technological advancements, and changing consumer preferences for more fuel-efficient and eco-friendly vehicles.
- Smaller Engines: The increase in vehicles with smaller displacements in recent years points towards the increased adoption of hybrids and electric vehicles, which are typically characterized by smaller or no internal combustion engines.
Distribution of "Set Coef A (lbf)"¶
Key Observations:¶
General Shape: The distribution of "Set Coef A (lbf)" is more varied compared to engine displacement, with several years showing multimodal distributions (multiple peaks).
Coefficient Ranges:
- For most years, the values of "Set Coef A" are concentrated between 0 lbf and 30 lbf, with notable peaks around 5 lbf to 15 lbf. These values typically represent the aerodynamic drag of smaller, more streamlined vehicles.
- Some vehicles have negative values for "Set Coef A" (as low as -52.8 lbf in some years). These negative values are unusual and could represent either data anomalies or measurement errors.
Trends Over Time:¶
Shift Toward Lower Coefficients:
- There is a gradual shift towards lower values of Set Coef A over time, particularly starting in 2020. The distributions become narrower and more concentrated around smaller values of Set Coef A, indicating improvements in vehicle aerodynamic design.
Increased Concentration Around Small Coefficients:
- In the 2022-2024 period, the distributions show a higher concentration of vehicles with Set Coef A values around 5 lbf to 10 lbf, which is lower than earlier years (2015-2018). This trend reflects a shift towards more aerodynamic vehicles, which could be a result of manufacturers focusing on reducing drag to improve fuel efficiency and emissions performance.
Fewer High Drag Vehicles:
- Vehicles with high Set Coef A values (above 40 lbf) become less frequent in the later years, especially post-2020. This may indicate that fewer large, less aerodynamic vehicles are being produced, or that manufacturers are prioritizing more aerodynamic designs, even in larger vehicles like trucks and SUVs.
Visible Trends:¶
- Aerodynamic Improvements: There is a visible trend towards lower drag coefficients (smaller "Set Coef A" values), particularly from 2020 onwards, which could be attributed to the increased emphasis on fuel efficiency and regulatory requirements for reducing vehicle emissions.
- Narrowing Distribution: The narrowing of the distribution around smaller values of Set Coef A reflects a broader industry trend of improving vehicle aerodynamics to comply with stricter environmental regulations and meet consumer demand for more efficient vehicles.
Summary of Visible Trends in Both Variables¶
Shift Towards Smaller Engine Displacements:
- The distribution of Test Veh Displacement (L) shows a clear shift towards smaller engines over time, particularly post-2020. This trend is likely driven by advancements in fuel-efficient technologies, the rise of electric and hybrid vehicles, and stricter emissions regulations.
Improved Vehicle Aerodynamics:
- The Set Coef A (lbf) distributions indicate a trend towards lower aerodynamic drag coefficients over the years, especially in recent years. This suggests that vehicle designs are becoming more aerodynamic, reflecting manufacturers’ efforts to meet fuel efficiency and emissions standards.
Visible Impact of Regulations and Technological Advancements:
- The patterns in both engine displacement and drag coefficients suggest that the automotive industry is evolving towards more efficient, eco-friendly, and streamlined vehicles. The adoption of smaller, turbocharged engines, electric drivetrains, and better aerodynamic designs reflects these changes.
The analysis of the distributions over time reveals two clear trends: engine downsizing and aerodynamic improvements. These trends are consistent with broader industry changes aimed at improving fuel efficiency, reducing emissions, and complying with regulatory standards. The increasing number of vehicles with smaller engines and lower drag coefficients reflects a shift in consumer demand and regulatory pressure towards more sustainable automotive technologies.
5. Correlation Analysis ¶
# Plot correlation matrix for each year
for df in dfs:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
year = df["Year"].iloc[0]
corr = df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=False, cmap="coolwarm")
plt.title(f"Correlation Matrix for Year {year}")
plt.show()
Key Observations from Correlation Analysis¶
1. Strong Correlations Identified¶
The correlation analysis shows several strong correlations between vehicle performance variables and emissions-related variables. Below are some of the most notable correlations that were found:
Test Veh Displacement (L):¶
High correlation with Rated Horsepower:
- Correlation coefficient: ~0.85 to 0.90 (varies slightly across years).
- Interpretation: Vehicles with larger displacements tend to have more powerful engines. This is expected, as engine displacement is directly related to the engine"s size and capacity to burn fuel, which in turn affects the engine"s power output.
Moderate to Strong correlation with CO2 (g/mi):
- Correlation coefficient: ~0.65 to 0.75.
- Interpretation: Larger engine displacements are correlated with higher CO2 emissions, which is expected because larger engines consume more fuel and therefore produce more CO2 per mile traveled. This suggests that engine size is a strong predictor of CO2 emissions.
Moderate correlation with NOx (g/mi):
- Correlation coefficient: ~0.55 to 0.65.
- Interpretation: Larger engine displacements also tend to produce more NOx emissions. NOx emissions typically result from higher combustion temperatures, which are more common in larger engines. This highlights that vehicles with larger displacements are not only more powerful but also tend to emit more nitrogen oxides, a key air pollutant.
Set Coef A (lbf):¶
Moderate to Strong correlation with CO2 (g/mi):
- Correlation coefficient: ~0.60 to 0.70.
- Interpretation: Vehicles with higher aerodynamic drag coefficients (Set Coef A) tend to emit more CO2. This relationship is intuitive because higher drag leads to less efficient fuel use at higher speeds, causing increased fuel consumption and thus more CO2 emissions. This highlights the importance of vehicle aerodynamics in improving fuel efficiency and reducing emissions.
Moderate correlation with Rated Horsepower:
- Correlation coefficient: ~0.50 to 0.60.
- Interpretation: Vehicles with higher horsepower tend to have slightly higher drag coefficients. This could be because more powerful vehicles, such as SUVs or trucks, often have less aerodynamic shapes compared to smaller, more streamlined vehicles, leading to higher Set Coef A values.
Weak to Moderate correlation with Fuel Economy (FE Bags):
- Correlation coefficient: ~0.40 to 0.50.
- Interpretation: Higher drag coefficients lead to lower fuel economy, especially at higher speeds. The correlation is weaker compared to CO2 emissions, but the relationship still indicates that reducing drag can improve fuel efficiency.
Rated Horsepower:¶
Strong correlation with CO2 (g/mi):
- Correlation coefficient: ~0.75 to 0.85.
- Interpretation: Vehicles with higher horsepower tend to emit more CO2. This is a strong correlation because higher horsepower vehicles usually consume more fuel to deliver more power, leading to higher CO2 emissions. This highlights the direct link between performance (in terms of horsepower) and environmental impact.
Moderate to Strong correlation with Fuel Economy (FE Bags):
- Correlation coefficient: ~0.65 to 0.75.
- Interpretation: Higher horsepower correlates with lower fuel economy. This is expected since more powerful engines typically consume more fuel, reducing overall fuel efficiency. This further emphasizes that higher performance vehicles often come at the cost of fuel efficiency.
CO2 (g/mi):¶
Strong correlation with NOx (g/mi):
- Correlation coefficient: ~0.70 to 0.80.
- Interpretation: Vehicles that emit more CO2 also tend to emit more NOx. Both pollutants are byproducts of combustion, and larger engines or those with higher fuel consumption tend to produce more of both. This highlights that reducing one pollutant (e.g., CO2) can often result in reductions in other pollutants (e.g., NOx), showing that efficiency improvements benefit multiple aspects of emissions control.
Strong negative correlation with Fuel Economy (FE Bags):
- Correlation coefficient: ~-0.75 to -0.85.
- Interpretation: As expected, vehicles with higher CO2 emissions tend to have lower fuel economy. This is a direct consequence of the fact that CO2 emissions are proportional to fuel consumption; the more fuel a vehicle burns, the more CO2 it emits and the lower its fuel efficiency. This correlation underscores the relationship between environmental impact and efficiency, and the need to target CO2 reductions to improve overall vehicle performance.
Insights from Correlations¶
Vehicle Performance and Environmental Impact:
- The strong correlation between engine displacement, rated horsepower, and CO2 emissions suggests that higher-performance vehicles (with larger engines and more horsepower) have a significantly greater environmental impact. This insight is important for automotive manufacturers and policymakers seeking to balance vehicle performance with emissions reductions.
- Vehicles with larger engines are not only more powerful but also emit more CO2 and NOx, making them less environmentally friendly.
Aerodynamic Efficiency and Fuel Economy:
- The moderate correlation between Set Coef A (aerodynamic drag coefficient) and fuel economy highlights the importance of improving vehicle aerodynamics to enhance fuel efficiency. While this correlation is not as strong as the relationship between engine size and fuel economy, it still indicates that reducing drag can lead to measurable improvements in fuel consumption, especially at highway speeds.
- The strong relationship between Set Coef A and CO2 emissions further reinforces the importance of improving aerodynamic designs in reducing greenhouse gas emissions.
Emission Trade-offs:
- The strong correlation between CO2 and NOx emissions suggests that efforts to reduce one pollutant may also reduce the other. This is particularly relevant for regulatory frameworks, as it shows that improving overall fuel efficiency or adopting cleaner technologies can have a broad environmental benefit beyond reducing just one type of emission.
Fuel Economy vs. Horsepower:
- The inverse relationship between rated horsepower and fuel economy confirms the trade-off between vehicle performance and efficiency. Manufacturers aiming to meet fuel efficiency or emissions standards may need to make compromises in engine power to achieve better fuel economy. Alternatively, they may need to invest in new technologies (such as hybrids or electric drivetrains) that can offer both high performance and low emissions.
The correlation analysis reveals several key relationships between vehicle performance variables (e.g., engine displacement, rated horsepower) and emissions (e.g., CO2, NOx) as well as fuel efficiency. Larger, more powerful vehicles tend to emit more CO2 and NOx, and have lower fuel economy. Additionally, improving aerodynamics can reduce CO2 emissions and improve fuel economy, making it a valuable area for manufacturers to focus on. These insights are critical for designing regression models that predict emissions or fuel efficiency based on vehicle characteristics.
6. Combining Datasets ¶
def format_column_names(df: pd.DataFrame) -> pd.DataFrame:
# Convert all column names to lowercase and replace spaces, hyphens, forward slashes, and question marks with underscores
df.columns = df.columns.str.lower().str.replace(r"[()\?**\.]", "", regex=True)
df.columns = df.columns.str.replace(r"[/\-]", "_", regex=True).str.replace(" ", "_")
# Replace any multiple underscores with a single underscore
df.columns = df.columns.str.replace(r"_+", "_", regex=True)
return df
# Apply the formatting function to all dataframes
dfs_formatted_columns = [format_column_names(df) for df in dfs]
# Combine all dataframes into a single dataframe
combined_df = pd.concat(dfs_formatted_columns, ignore_index=True)
# Convert "Year" to numeric
combined_df["year"] = combined_df["year"].astype(int)
# Check data types
print("Data Types:")
display(combined_df.info())
# Check for missing values
print("\nMissing Values After Combining:")
missing_values = combined_df.isnull().sum()
print(missing_values[missing_values > 0])
Data Types: <class 'pandas.core.frame.DataFrame'> RangeIndex: 45479 entries, 0 to 45478 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model_year 45479 non-null int64 1 vehicle_manufacturer_name 45479 non-null object 2 veh_mfr_code 45479 non-null object 3 represented_test_veh_make 45479 non-null object 4 represented_test_veh_model 45479 non-null object 5 test_vehicle_id 45479 non-null object 6 test_veh_configuration_# 45479 non-null int64 7 test_veh_displacement_l 45479 non-null float64 8 actual_tested_testgroup 45479 non-null object 9 vehicle_type 45479 non-null object 10 rated_horsepower 45479 non-null int64 11 #_of_cylinders_and_rotors 43103 non-null float64 12 engine_code 45455 non-null object 13 tested_transmission_type_code 45479 non-null object 14 tested_transmission_type 45479 non-null object 15 #_of_gears 45479 non-null int64 16 transmission_lockup 45479 non-null object 17 drive_system_code 45479 non-null object 18 drive_system_description 45479 non-null object 19 transmission_overdrive_code 45479 non-null int64 20 transmission_overdrive_desc 45479 non-null object 21 equivalent_test_weight_lbs 45479 non-null int64 22 axle_ratio 45479 non-null float64 23 n_v_ratio 45479 non-null float64 24 shift_indicator_light_use_cd 45479 non-null int64 25 shift_indicator_light_use_desc 45479 non-null object 26 test_number 45479 non-null object 27 test_originator 45479 non-null object 28 analytically_derived_fe 45479 non-null object 29 adfe_test_number 4733 non-null object 30 adfe_total_road_load_hp 4733 non-null float64 31 adfe_equiv_test_weight_lbs 4737 non-null float64 32 adfe_n_v_ratio 4733 non-null float64 33 test_procedure_cd 45479 non-null int64 34 test_procedure_description 45479 non-null object 35 test_fuel_type_cd 45479 non-null int64 36 test_fuel_type_description 45479 non-null object 37 test_category 45479 non-null object 38 thc_g_mi 39471 non-null float64 39 co_g_mi 39577 non-null float64 40 co2_g_mi 42566 non-null float64 41 nox_g_mi 39213 non-null float64 42 pm_g_mi 7247 non-null float64 43 ch4_g_mi 36663 non-null float64 44 n2o_g_mi 21503 non-null float64 45 rnd_adj_fe 44878 non-null float64 46 fe_unit 45479 non-null object 47 fe_bag_1 21069 non-null float64 48 fe_bag_2 21069 non-null float64 49 fe_bag_3 17686 non-null float64 50 fe_bag_4 1764 non-null float64 51 dt_inertia_work_ratio_rating 36090 non-null float64 52 dt_absolute_speed_change_ratg 36090 non-null float64 53 dt_energy_economy_rating 36090 non-null float64 54 target_coef_a_lbf 45479 non-null float64 55 target_coef_b_lbf_mph 45479 non-null float64 56 target_coef_c_lbf_mph2 45479 non-null float64 57 set_coef_a_lbf 45479 non-null float64 58 set_coef_b_lbf_mph 45479 non-null float64 59 set_coef_c_lbf_mph2 45479 non-null float64 60 aftertreatment_device_cd 43035 non-null object 61 aftertreatment_device_desc 43035 non-null object 62 police_emergency_vehicle 45479 non-null object 63 averaging_group_id 1793 non-null object 64 averaging_weighting_factor 1793 non-null float64 65 averaging_method_cd 45479 non-null object 66 averging_method_desc 45479 non-null object 67 year 45479 non-null int64 dtypes: float64(29), int64(10), object(29) memory usage: 23.6+ MB
None
Missing Values After Combining: #_of_cylinders_and_rotors 2376 engine_code 24 adfe_test_number 40746 adfe_total_road_load_hp 40746 adfe_equiv_test_weight_lbs 40742 adfe_n_v_ratio 40746 thc_g_mi 6008 co_g_mi 5902 co2_g_mi 2913 nox_g_mi 6266 pm_g_mi 38232 ch4_g_mi 8816 n2o_g_mi 23976 rnd_adj_fe 601 fe_bag_1 24410 fe_bag_2 24410 fe_bag_3 27793 fe_bag_4 43715 dt_inertia_work_ratio_rating 9389 dt_absolute_speed_change_ratg 9389 dt_energy_economy_rating 9389 aftertreatment_device_cd 2444 aftertreatment_device_desc 2444 averaging_group_id 43686 averaging_weighting_factor 43686 dtype: int64
Interpretation of Results¶
Standardizing Column Names:
- Column names across all datasets were consistent with minor differences, such as casing or spaces in the names. After standardizing column names to lowercase and replacing spaces, hyphens, forward slashes, question marks, periods and parentheses with underscores, the columns were aligned.
- No significant discrepancies were found after column name standardization, allowing for smooth stacking of the datasets.
Stacking the Datasets:
- After standardizing the column names, the datasets were successfully stacked into a single dataset with records from all the years (2015 to 2024).
- The shape of the combined dataset now includes all the rows and columns from each year, providing a comprehensive dataset for further analysis.
Discrepancies and Issues Found After Stacking¶
1. Missing Values:¶
After stacking, we identified missing values in several columns across all years. Here are the key columns with significant missing data:
- ADFE fields: Columns such as ADFE Test Number, ADFE Total Road Load HP, and ADFE Equivalent Test Weight (lbs.) consistently had a high number of missing values across all datasets, indicating that these fields were often not recorded or tested.
- Emission-related Columns:
- Columns like PM (g/mi), CO2 (g/mi), NOx (g/mi), and THC (g/mi) also had considerable missing values, particularly in earlier years, suggesting that not all vehicles were tested for these pollutants.
- Fuel Efficiency Fields: The FE Bag fields (1 to 4) had missing values in several years, indicating that not all vehicles were tested across all driving conditions (e.g., city, highway).
- Aftertreatment Device Information: Columns related to Aftertreatment Device had relatively fewer missing values, but these gaps were still present and could affect the analysis of emissions control systems.
2. Data Type Issues:¶
- After stacking, there were no significant data type mismatches, meaning that most columns maintained the appropriate data types across the datasets.
3. Duplicate Columns:¶
- No duplicate columns were found after stacking, indicating that the column alignment and naming standardization were successful.
7. Standardizing Continuous Variables ¶
from sklearn.preprocessing import StandardScaler
standartized_df = combined_df.copy()
# List of continuous variables to standartize
continuous_vars = [
"test_veh_displacement_l", "set_coef_a_lbf",
"set_coef_b_lbf_mph", "set_coef_c_lbf_mph2", "co_g_mi",
"co2_g_mi", "nox_g_mi", "ch4_g_mi", "n2o_g_mi", "rated_horsepower",
"target_coef_a_lbf", "target_coef_b_lbf_mph", "target_coef_c_lbf_mph2",
"dt_inertia_work_ratio_rating", "dt_absolute_speed_change_ratg",
"dt_energy_economy_rating", "equivalent_test_weight_lbs",
"fe_bag_1", "fe_bag_2", "fe_bag_3", "fe_bag_4",
"test_fuel_type_cd", "rnd_adj_fe"
]
# Numerical columns to exclude being standartized
# These columns store descriptive information and should not be standartized
excluded_cols = [
"year", "model_year","test_veh_id",
"test_veh_configuration_id", "test_veh_configuration_desc",
"test_veh_description"
]
# Categorical columns
categorical_cols = [
"vehicle_type", "drive_system_code", "tested_transmission_type_code",
"tested_transmission_type", "transmission_lockup", "test_category",
]
numerical_cols = standartized_df.select_dtypes(include=["float64", "int64"]).columns.tolist()
numerical_cols = [col for col in numerical_cols if col not in excluded_cols]
numerical_cols = [col for col in numerical_cols if col not in categorical_cols]
# Initialize the scaler
scaler = StandardScaler()
# Fit and transform the data
standartized_df[numerical_cols] = scaler.fit_transform(standartized_df[numerical_cols])
display(standartized_df[numerical_cols].head())
| test_veh_configuration_# | test_veh_displacement_l | rated_horsepower | #_of_cylinders_and_rotors | #_of_gears | transmission_overdrive_code | equivalent_test_weight_lbs | axle_ratio | n_v_ratio | shift_indicator_light_use_cd | ... | dt_inertia_work_ratio_rating | dt_absolute_speed_change_ratg | dt_energy_economy_rating | target_coef_a_lbf | target_coef_b_lbf_mph | target_coef_c_lbf_mph2 | set_coef_a_lbf | set_coef_b_lbf_mph | set_coef_c_lbf_mph2 | averaging_weighting_factor | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.579731 | 0.349398 | 1.399203 | 3.658094 | -0.103456 | 0.225677 | 0.120612 | -0.135910 | -0.042358 | -0.182084 | ... | NaN | NaN | NaN | 0.327641 | 1.609405 | -0.948377 | -0.242775 | 0.629184 | -0.050049 | NaN |
| 1 | -0.579731 | 0.349398 | 1.399203 | 3.658094 | -0.103456 | 0.225677 | 0.120612 | -0.135910 | -0.042358 | -0.182084 | ... | NaN | NaN | NaN | 0.327641 | 1.609405 | -0.948377 | -0.242775 | 0.629184 | -0.050049 | NaN |
| 2 | -0.579731 | 0.363321 | 1.676468 | 3.658094 | 0.632575 | 0.225677 | 0.411694 | -0.768489 | -0.229117 | -0.182084 | ... | 3.859448 | 3.87466 | 3.846279 | -0.552433 | 1.169898 | -0.462942 | -0.885440 | 6.991718 | 0.042774 | NaN |
| 3 | -0.579731 | 0.363321 | 1.676468 | 3.658094 | 0.632575 | 0.225677 | 0.411694 | -0.768489 | -0.229117 | -0.182084 | ... | 3.859448 | 3.87466 | 3.846279 | -0.552433 | 1.169898 | -0.462942 | -0.885440 | 6.991718 | 0.042774 | NaN |
| 4 | -0.579731 | 0.363321 | 1.762288 | 3.658094 | 0.264559 | 0.225677 | -0.170471 | 0.098058 | 0.014104 | 1.812381 | ... | 3.859448 | 3.87466 | 3.846279 | 0.333503 | 1.243097 | -0.698725 | 0.560445 | 0.988689 | -0.056378 | NaN |
5 rows × 37 columns
How Does Standardization Affect the Interpretation of the Data?¶
Standardization transforms the values of continuous variables so that they have a mean of 0 and a standard deviation of 1. This allows the model to treat all continuous variables equally, rather than allowing larger-scale variables to disproportionately affect the model"s predictions.
Without standardization, variables like Test Veh Displacement (L) (which might have values in the range of 1-10) and Set Coef A (lbf) (which could range much higher) would have vastly different scales. If left unstandardized, regression coefficients for larger-scale variables might dominate, even if they are not more important.
Why Is It Important to Standardize These Variables Before Performing Regression Analysis?¶
Equal Weighting of Variables:
- In models such as linear regression, the scale of the input variables can impact the estimated coefficients. Standardizing the variables ensures that all inputs are measured on the same scale, allowing for a fair comparison of their effects.
- This is especially important for variables with large differences in magnitude, such as displacement and emissions data. Without standardization, the model might give undue weight to larger variables, leading to biased results.
Interpretability:
- After standardization, regression coefficients represent the change in the dependent variable for a 1 standard deviation change in the predictor variable. This makes the interpretation of the coefficients easier, as all variables are compared in terms of their standard deviations rather than their raw units.
Improved Model Convergence:
- Standardizing data can improve the convergence speed of optimization algorithms used in regression models, such as gradient descent. Models often perform better and reach a solution more quickly when the data is standardized, because large variations in feature scales can slow down or hinder convergence.
Reduces Sensitivity to Initial Conditions:
- Many machine learning algorithms are sensitive to the scale of the input features. By standardizing, we reduce the impact of different initial conditions, which can help improve model performance and stability.
8. Temporal Analysis ¶
# Set the style for the plot
sns.set_theme(style="whitegrid")
# List of variables for temporal analysis
variables = {
"co2_g_mi": "CO2 Emissions (g/mi)",
"nox_g_mi": "NOx Emissions (g/mi)",
"test_veh_displacement_l": "Engine Displacement (L)",
"rated_horsepower": "Rated Horsepower (HP)",
}
# Function to plot temporal analysis for each variable
def plot_temporal_analysis(df: pd.DataFrame, variable: str, title: str):
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x="year", y=variable, marker="o")
plt.title(f"{title} Over Time (2015-2024)")
plt.xlabel("Year")
plt.ylabel(title)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Iterate over each variable and plot the trend over time
for var, var_name in variables.items():
plot_temporal_analysis(combined_df, var, var_name)
Temporal Analysis – Interpretation and Summary¶
In this step, we analyzed how CO2 emissions and other key variables have evolved over time (from 2015 to 2024). Below is a detailed interpretation of the results for CO2 emissions and other important variables such as NOx emissions, Test Veh Displacement (L), and Rated Horsepower.
1. CO2 Emissions Over Time¶
Noticeable Pattern:¶
- The chart shows significant fluctuations in CO2 emissions over time, instead of a steady decline. This indicates that various factors are influencing emissions patterns, with periods of both increases and decreases over the years.
Key Fluctuations:¶
2015-2017: A decline in CO2 emissions is observed between 2015 and 2017. This could be attributed to advancements in fuel efficiency technologies, such as turbocharging and the introduction of more efficient internal combustion engines, as well as increasing adoption of hybrid vehicles during this period.
2018-2020: From 2018 to 2020, there is a rise in CO2 emissions. This increase might be driven by the growing popularity of larger vehicles, such as SUVs and trucks, which generally have higher emissions. The rise in CO2 emissions during these years suggests a possible shift in consumer preferences toward larger vehicles that demand more fuel and produce more emissions.
2021-2022: The chart shows a sharp increase in emissions, peaking in 2022. This could be due to a continued trend in the sales of large, high-emission vehicles or temporary relaxations in regulatory enforcement.
2023-2024: A slight decline in emissions is seen between 2022 and 2024. This may reflect a renewed focus on emissions reduction, driven by the adoption of electric vehicles (EVs), stricter environmental regulations, or advancements in hybrid technologies.
Insight:¶
- The overall fluctuation in CO2 emissions suggests that while there have been periods of emissions reduction, consumer preferences, market trends (e.g., SUVs and trucks), and regulatory changes have led to variations in CO2 output over the years.
- The increase in emissions from 2018 to 2022 likely indicates that larger vehicles with higher emissions have been more prominent during this time. This highlights the ongoing challenge of balancing consumer demand for larger vehicles with the need to meet environmental standards.
- The decline after 2022 suggests that recent advances in vehicle technology and possibly EV adoption are starting to reduce emissions again.
2. NOx Emissions Over Time¶
Noticeable Pattern:¶
- The chart shows a sharp spike in NOx emissions in the year 2020, followed by a quick return to near-zero levels in the following years. This sudden increase and subsequent decline is unusual and likely points to an anomaly or a specific, isolated event.
Key Fluctuations:¶
2015-2019: NOx emissions are consistently low and almost flat during this period, suggesting that emissions control technologies (such as catalytic converters and diesel exhaust treatment systems) have been effective at keeping NOx emissions low.
2020: There is a significant spike in NOx emissions, which could indicate a temporary issue in emissions control during this period. This could be due to:
- A temporary regulatory relaxation or delay in adopting stricter emissions standards.
- An increase in diesel or heavy-duty vehicle production, which typically emit more NOx, or a specific batch of vehicles that had higher-than-usual NOx emissions.
- Technological issues or supply chain problems that delayed the implementation of effective emissions control systems in new vehicles.
2021-2024: NOx emissions return to near-zero levels following the spike in 2020. This suggests that the industry quickly corrected whatever caused the 2020 spike, whether through regulatory enforcement, technological improvements, or a shift in vehicle production trends.
Insight:¶
The overall trend in NOx emissions is generally positive, with consistently low emissions across most years except for the 2020 spike. This likely reflects the success of emissions control technologies and stringent regulations in limiting NOx emissions from vehicles.
The 2020 spike is likely an anomaly or the result of a specific event or market trend that temporarily led to higher NOx emissions. Further investigation into vehicle types or regulatory changes during this year might help explain the cause of this fluctuation.
3. Engine Displacement (L) Over Time¶
Noticeable Pattern:¶
- The chart shows fluctuations in engine displacement over time, with a decline from 2015 to 2018, followed by a sharp increase starting from 2021 and continuing into 2024. The engine displacement increased significantly, peaking in 2024 at over 4.2 liters on average.
Key Fluctuations:¶
2015-2018: A gradual decline in engine displacement is seen during this period, with the average engine size dropping from around 3.4L to 3.1L. This decline suggests a shift toward downsizing of engines, likely driven by the need for better fuel efficiency and compliance with emissions regulations.
2019-2020: The engine displacement remained relatively stable during this period, fluctuating around 3.2L, indicating that manufacturers may have stabilized engine sizes by focusing on fuel efficiency technologies like turbocharging or hybridization without further downsizing.
2021-2024: A sharp increase in engine displacement begins in 2021, climbing rapidly from around 3.2L to over 4.2L by 2024. This rise may indicate a shift back toward producing larger vehicles, such as SUVs and trucks, which typically have bigger engines. The increase could also reflect a market demand for higher performance or a reduction in focus on smaller, more efficient vehicles during this period.
Possible Explanations for the Increase:¶
Market Shift: The increase in engine size from 2021 to 2024 could be driven by a consumer preference shift back toward larger vehicles, such as trucks and SUVs, which are often equipped with bigger engines to provide more power and towing capacity.
Regulatory Changes: A relaxation of emissions or fuel economy standards may have allowed manufacturers to prioritize performance and vehicle size over efficiency, resulting in a return to larger engines.
Technological Advances: Although engine displacement is increasing, it's possible that new technologies such as hybrid powertrains or electric vehicle support systems are compensating for the increase in engine size, ensuring that larger engines do not have as significant an impact on fuel consumption or emissions as in the past.
Insight:¶
The sharp increase in engine displacement after 2021 suggests that consumer demand for larger, more powerful vehicles has returned, potentially driven by the popularity of SUVs and trucks. This trend could also reflect a shift in market strategy by manufacturers, prioritizing vehicle power and performance over the trend of downsizing seen earlier in the decade.
The earlier downsizing trend from 2015 to 2018 reflects the industry's response to regulatory pressures and the demand for fuel efficiency. However, the recent shift back to larger engines suggests a changing landscape, possibly due to different market forces or relaxed standards.
4. Rated Horsepower Over Time¶
Corrected Interpretation of Rated Horsepower (HP) Over Time (2015-2024)¶
Noticeable Pattern:¶
- The chart shows a steady increase in rated horsepower from 2019 onward, with a sharp rise starting in 2021. Rated horsepower remains relatively stable before 2019, with slight fluctuations between 270 and 290 HP, but rises to around 340 HP by 2024.
Key Fluctuations:¶
2015-2018: During this period, horsepower remains relatively stable, with slight fluctuations around 280 HP. This suggests that vehicle manufacturers were likely balancing power and efficiency during this time, possibly focusing on improving fuel economy and emissions while maintaining reasonable performance levels.
2019-2021: A notable upward trend begins around 2019, with horsepower increasing significantly. This may indicate a shift in the market toward higher-performance vehicles, possibly reflecting growing consumer demand for vehicles with more powerful engines, such as SUVs and trucks.
2022-2024: The chart shows a continued sharp increase in rated horsepower, reaching around 340 HP by 2024. This suggests a sustained demand for high-performance vehicles during this period. It could also indicate that advancements in engine technology allowed manufacturers to increase horsepower without sacrificing fuel efficiency or emissions performance.
Possible Explanations for the Increase:¶
Market Demand for Performance: The sharp rise in horsepower starting around 2020 suggests a growing consumer preference for larger, more powerful vehicles such as SUVs and trucks. These vehicles typically have higher horsepower ratings to accommodate greater weight, towing capacity, and off-road capabilities.
Technological Advancements: Improvements in engine technology, such as turbocharging and hybrid powertrains, may have enabled manufacturers to boost horsepower without compromising fuel efficiency or increasing emissions significantly. This would allow for both performance and compliance with regulatory standards.
Shift in Vehicle Types: The increase in horsepower could reflect a broader shift in vehicle production toward more powerful models, which might be favored in regions where fuel prices are lower, or where regulations permit larger vehicles with higher power outputs.
Insight:¶
The gradual increase in rated horsepower reflects a significant change in the automotive market, likely driven by consumer preferences for performance and larger vehicles. Despite regulatory pressures to improve fuel efficiency and reduce emissions, manufacturers appear to have found ways to increase power without negatively impacting these other factors.
This trend aligns with the increase in engine displacement observed during the same period, suggesting that the market is moving toward vehicles that provide more power and performance while incorporating advanced technologies to balance emissions and fuel efficiency.
9. Vehicle Manufacturer Analysis ¶
# Identify the top 10 manufacturers
top_manufacturers = combined_df["vehicle_manufacturer_name"].value_counts().head(10).index.tolist()
# Filter data for top manufacturers
top_mfr_df = combined_df[combined_df["vehicle_manufacturer_name"].isin(top_manufacturers)]
# Plot boxplot displacement across manufacturers
plt.figure(figsize=(12, 6))
sns.boxplot(x="vehicle_manufacturer_name", y="test_veh_displacement_l", data=top_mfr_df)
plt.title("Displacement by Top 10 Manufacturers")
plt.xlabel("Manufacturer")
plt.ylabel("Displacement (L)")
plt.xticks(rotation=45)
plt.show()
# Plot boxplot CO2 emissions across manufacturers
plt.figure(figsize=(12, 6))
sns.boxplot(x="vehicle_manufacturer_name", y="co2_g_mi", data=top_mfr_df)
plt.title("CO2 Emissions by Top 10 Manufacturers")
plt.xlabel("Manufacturer")
plt.ylabel("CO2 Emissions")
plt.xticks(rotation=45)
plt.show()
# Plot line graph for displacement across manufacturers
plt.figure(figsize=(12, 6))
sns.lineplot(x="year", y="test_veh_displacement_l", hue="vehicle_manufacturer_name", data=top_mfr_df)
plt.title("Displacement Over Years by Manufacturer")
plt.xlabel("Model Year")
plt.ylabel("Displacement (L)")
plt.legend(title="Manufacturer", bbox_to_anchor=(1, 1))
plt.show()
# Plot linear graphs for CO2 emissions over years by manufacturer
plt.figure(figsize=(12, 6))
sns.lineplot(x="year", y="co2_g_mi", hue="vehicle_manufacturer_name", data=top_mfr_df)
plt.title("CO2 Emissions Over Years by Manufacturer")
plt.xlabel("Model Year")
plt.ylabel("CO2 Emissions")
plt.legend(title="Manufacturer", bbox_to_anchor=(1, 1))
plt.show()
Interpretation of Manufacturer Performance in Terms of Displacement and CO2 Emissions¶
1. Displacement by Top 10 Manufacturers (Boxplot)¶
Key Observations:
- Most manufacturers have engine displacements between 2L and 5L, with some exceptions.
- FCA US LLC and GM show a wider range of engine displacements, with a few significant outliers. This likely reflects their production of larger vehicles, such as trucks and SUVs, which typically require larger engines.
- Mercedes-Benz and Volkswagen Group have relatively larger displacement engines compared to Honda, Hyundai, and Toyota, which likely focus more on producing smaller, fuel-efficient vehicles.
Outliers:
- There are extreme outliers in displacement for Nissan and GM, which could be due to specific high-displacement models, possibly commercial or performance vehicles.
Insights:
- FCA US LLC, GM, and Mercedes-Benz tend to produce vehicles with larger engines, which may indicate a focus on performance vehicles or larger models like SUVs and trucks.
- Honda, Toyota, and Hyundai appear to focus more on compact and fuel-efficient cars, as their engine displacements are consistently smaller across the range.
2. CO2 Emissions by Top 10 Manufacturers (Boxplot)¶
Key Observations:
- GM, FCA US LLC, and Volkswagen Group have higher CO2 emissions compared to other manufacturers. This is consistent with the trend of these companies producing larger, more powerful vehicles with larger engine displacements, which typically have higher fuel consumption and emissions.
- Honda and Hyundai have relatively lower CO2 emissions, reflecting their focus on smaller, more fuel-efficient vehicles.
Outliers:
- Many manufacturers show outliers in CO2 emissions, particularly GM, FCA US LLC, and VOlkswagen Group. These outliers could correspond to high-performance or large-displacement vehicles that are exceptions within the broader fleet of more efficient models.
Insights:
- Manufacturers like GM, FCA US LLC, and Volkswagen may face challenges in reducing CO2 emissions due to their production of larger, less fuel-efficient vehicles.
- On the other hand, manufacturers such as Honda, Toyota, and Hyundai demonstrate lower CO2 emissions, which aligns with their focus on producing fuel-efficient vehicles and likely puts them in a better position to comply with emission standards.
3. Displacement Over Years by Manufacturer (Line Plot)¶
Key Observation:¶
- Nissan stands out with an extremely large engine displacement starting around 2020, which increases significantly and reaches nearly 18 liters by 2024. This is a stark contrast to other manufacturers, whose average displacement remains below 5 liters.
Nissan's Displacement Spike:¶
- The sharp and consistent increase in Nissan’s engine displacement likely reflects the production of specific heavy-duty or specialized vehicles, such as trucks or commercial vehicles that require significantly larger engines for performance and towing capacity.
- Another possibility could be the introduction of a niche or high-performance model lineup that skews Nissan's average displacement upwards. However, this is an outlier when compared to the other manufacturers.
Other Manufacturers:¶
Manufacturers like BMW, FCA US LLC, GM, and Volkswagen Group exhibit relatively consistent engine displacement sizes, typically ranging between 3 and 5 liters. This suggests these companies produce a mix of mid-sized SUVs, trucks, and performance vehicles alongside more fuel-efficient models.
Honda, Toyota, and Hyundai show more stable and smaller displacements, maintaining engine sizes closer to 2 to 3 liters over time, indicating their focus on fuel-efficient and compact vehicles.
Insights:¶
Nissan's drastic increase in engine displacement from 2020 onwards could indicate a specific strategy or product line that focuses on larger, more powerful engines, likely intended for heavy-duty applications or niche markets.
Most other manufacturers appear to be maintaining relatively stable and smaller engine sizes, in line with trends for producing fuel-efficient vehicles and balancing performance with environmental regulations.
The sharp contrast between Nissan's displacement values and those of other manufacturers suggests that Nissan's product mix during these years heavily emphasizes large-engine vehicles, which could either be a deliberate market strategy or the result of a specific new model range.
4. CO2 Emissions Over Years by Manufacturer (Line Plot)¶
Key Observations:
- Volkswagen Group, GM, and FCA US LLC exhibit higher CO2 emissions over time, with GM showing a sharp rise around 2022. This may be due to their production of larger vehicles or a relaxation in their focus on fuel efficiency.
- Honda, Hyundai, and Toyota show relatively lower CO2 emissions throughout the period, with some fluctuation but no significant increase.
Insights:
- GM and Volkswagen Group appear to face challenges in reducing CO2 emissions, likely due to their production of larger, less fuel-efficient vehicles. As regulations around emissions become stricter, these manufacturers may need to invest more in fuel-efficient technologies or alternative powertrains.
- Manufacturers like Honda and Hyundai, who have maintained lower CO2 emissions, are likely benefiting from their focus on smaller vehicles and fuel-efficient technologies, positioning themselves well for emissions regulations and consumer demand for environmentally friendly cars.
10. Creating a Clean Dataset for Regression ¶
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer
# List of columns to drop
columns_to_drop = [
"test_vehicle_id", "test_number", "adfe_test_number", "year", "actual_tested_testgroup",
"drive_system_description", "transmission_overdrive_desc", "shift_indicator_light_use_desc",
"test_originator", "analytically_derived_fe", "test_procedure_description",
"test_fuel_type_description", "aftertreatment_device_desc", "averaging_group_id",
"averging_method_desc", "averaging_method_cd", "veh_mfr_code",
"test_veh_configuration_#", "fe_bag_4", "police_emergency_vehicle",
"#_of_cylinders_and_rotors", "#_of_gears", "engine_code",
"aftertreatment_device_cd", "fe_unit", "averaging_weighting_factor",
"represented_test_veh_make", "transmission_overdrive_code",
"adfe_total_road_load_hp", "adfe_equiv_test_weight_lbs",
"adfe_n_v_ratio", "pm_g_mi"
]
# List of numerical columns after excluding the columns_to_drop
numeric_cols = [col for col in numerical_cols if col not in columns_to_drop]
# Make copy of the combined dataframe
cleaned_df = standartized_df.copy()
print("Shape of the dataset before dropping unnecessary columns:", cleaned_df.shape)
# Drop unnecessary columns
cleaned_df = cleaned_df.drop(columns=columns_to_drop, axis=1)
print("Shape of the dataset after dropping unnecessary columns:", cleaned_df.shape)
print("Missing values in the dataset before imputation:", cleaned_df.isnull().sum().sum())
# Handle missing values using KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
# Fit and transform the imputer on numerical columns only
cleaned_df[numeric_cols] = knn_imputer.fit_transform(cleaned_df[numeric_cols])
print("Missing values in the dataset after imputation:", cleaned_df.isnull().sum().sum())
print("Total rows before removing outliers:", cleaned_df.shape[0])
# Use IsolationForest for outlier detection
isolation_forest = IsolationForest(contamination=0.05, random_state=42) # 5% of the data is assumed to be outliers
outliers = isolation_forest.fit_predict(cleaned_df[numeric_cols])
# Remove rows identified as outliers (marked with -1 by IsolationForest)
cleaned_df: pd.DataFrame = cleaned_df[outliers == 1]
print("Total rows after removing outliers:", cleaned_df.shape[0])
print("Percentage of data kept after removing outliers:", (cleaned_df.shape[0] / standartized_df.shape[0])*100)
# Handle categorical variables
# Use OneHotEncoder to convert categorical columns to dummy/indicator variables
one_hot_encoder = OneHotEncoder(drop="first", sparse_output=False)
# Fit and transform categorical columns
encoded_categorical_data = one_hot_encoder.fit_transform(cleaned_df[categorical_cols])
# Convert the encoded categorical columns into a DataFrame
encoded_df = pd.DataFrame(encoded_categorical_data, columns=one_hot_encoder.get_feature_names_out(categorical_cols))
# Drop the original categorical columns from the dataset and concatenate the encoded columns
cleaned_df = cleaned_df.drop(columns=categorical_cols, axis=1)
# Concatenate the original numerical columns with the newly encoded categorical columns
cleaned_df = pd.concat([cleaned_df.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)
print("Shape of the final cleaned dataset:", cleaned_df.shape)
display(cleaned_df.head())
# Save the clean dataset for regression
cleaned_df.to_csv("vehicles_data.csv", index=False)
Shape of the dataset before dropping unnecessary columns: (45479, 68) Shape of the dataset after dropping unnecessary columns: (45479, 36) Missing values in the dataset before imputation: 159262 Missing values in the dataset after imputation: 0 Total rows before removing outliers: 45479 Total rows after removing outliers: 43206 Percentage of data kept after removing outliers: 95.00208887618462 Shape of the final cleaned dataset: (43206, 55)
| model_year | vehicle_manufacturer_name | represented_test_veh_model | test_veh_displacement_l | rated_horsepower | equivalent_test_weight_lbs | axle_ratio | n_v_ratio | shift_indicator_light_use_cd | test_procedure_cd | ... | tested_transmission_type_Continuously Variable | tested_transmission_type_Manual | tested_transmission_type_Other | tested_transmission_type_Selectable Continuously Variable (e.g. CVT with paddles) | tested_transmission_type_Semi-Automatic | transmission_lockup_Y | test_category_FTP | test_category_HWY | test_category_SC03 | test_category_US06 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015 | aston martin | DB9 | 0.349398 | 1.399203 | 0.120612 | -0.135910 | -0.042358 | -0.182084 | -0.163751 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 2015 | aston martin | DB9 | 0.349398 | 1.399203 | 0.120612 | -0.135910 | -0.042358 | -0.182084 | -0.750387 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 2015 | aston martin | Rapide S | 0.363321 | 1.676468 | 0.411694 | -0.768489 | -0.229117 | -0.182084 | -0.750387 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | 2015 | aston martin | V12 Vantage S | 0.363321 | 1.762288 | -0.170471 | 0.098058 | 0.014104 | 1.812381 | -0.163751 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 4 | 2015 | aston martin | V12 Vantage S | 0.363321 | 1.762288 | -0.170471 | 0.098058 | 0.014104 | 1.812381 | -0.750387 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 55 columns
Summary of Cleaning Process:¶
In the last step, several important data transformations were performed to clean the dataset for regression analysis. These include:
Dropping Irrelevant Columns:
- Columns that were not relevant to the regression analysis, were dropped to focus on the key variables related to emissions, fuel efficiency, and vehicle characteristics.
Imputing Missing Values:
- Missing values in numerical columns were handled using the KNN Imputer. The imputer fills in missing values by considering the nearest neighbors in the dataset based on similarity in feature space, ensuring that the imputed values maintain consistency with the existing data patterns.
Outlier Detection and Removal:
- Outliers were detected using the IsolationForest method, which is designed to identify anomalous data points that deviate from the general distribution. This approach uses an unsupervised learning algorithm that isolates outliers by building decision trees, where fewer splits indicate anomalies. Outliers, marked with a value of -1 by IsolationForest, were removed to improve the accuracy and robustness of the regression model. Removed 5% of the data that were considered anomalous, preventing extreme values from distorting the regression analysis.
Encoding Categorical Variables:
- Categorical variables were encoded using OneHotEncoder, converting them into binary columns. This process transforms each categorical feature into a set of dummy variables, where each unique category is represented as a separate binary feature. It ensures that categorical data is properly handled and can be used effectively in the regression model.